Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple layers of computaton. At each layer, we have a vector values called actvatons, whch we ll denote h (l) for the actvatons at layer l. By conventon, we ll say that h (0) x (the nput) and for the last layer L, we have h (L) be the output, whch could be the probablty of a label (we ll dscuss regresson and other types of outputs later). We can express the actvatons n the neural net recursvely as h (l) σ(w (l) h (l ) + b (l) ), where σ(z) s a nonlnearty. Last lecture, we saw a common choce of nonlnearty, the sgmod (or logstc functon): σ(z) + exp( z). In a neural network, we must choose the number of layers L and the sze of each layer (the sze of h (0) s the number of nput attrbutes and the sze of h (L) s the number of outputs more on that later). If we have a network wth three layers of weghts (correspondng to two hdden layers), we can explctly wrte t as: h (3) σ(w (3) σ(w (2) σ(w () h (0) + b () ) + b (2) ) + b (3) ). Note that there s a lttle bt of confuson n termnology: ths network has three layers of weghts, gven by W (), W (2), and W (3), but two layers of hdden actvatons h () and h (2), snce h (0) s smply x (and therefore not hdden ), and h (3) s the output. Tradtonally, hdden layer n a neural network refers to a layer of hdden actvatons (or neurons ), but more recent termnology typcally refers to the weghts as layers. They are off by one, so don t get confused! Now, let s check our understandng: What s the data?
Answer. Just lke n logstc regresson, we have an nput vector x, whch can be ether contnuous real-valued, bnary, or categorcal. Just lke wth logstc regresson, categorcal values are often converted nto one-hot encodngs: f some feature takes on M values, we mght add M entres to x, where all but one s zero. Ths avods the need to mpose an orderng on that varable. The output y for now s categorcal (ust lke n logstc regresson), though we ll dscuss how we can have real-valued y as well later. What defnes (parameterzes) the hypothess? Answer. To defne a neural network, we have to frst choose the number and sze of the layers. Ths s a desgn decson (so techncally, number and sze of layers are hyperparameters). We choose these the same way we choose the features h(x) to use for logstc regresson, the amount of regularzaton, etc: we use our ntuton and, when n doubt, valdate aganst the valdaton set. Once we ve fgured out the number of layers and ther sze, the neural network s fully defned by the weghts θ {W (), b (),, W (L), b (L) }. It s precsely these weghts that our learnng algorthm needs to optmze. What s the obectve? Answer. Neural networks are condtonal (or dscrmnatve ) models. Just lke logstc regresson and lnear regresson, neural networks optmze the condtonal log-lkelhood, gven by L(θ) N log p(y x, θ) In the case of a bnary label y and a three-layer network (two hdden layers), ths s gven by L(θ) N { f y 0 : σ(w (3) σ(w (2) σ(w () x + b () ) + b (2) ) + b (3) ) f y : σ(w (3) σ(w (2) σ(w () x + b () ) + b (2) ) + b (3) ) What s the algorthm? Answer. Just lke wth logstc regresson, we ll use gradent ascent to optmze the neural network parameters θ. However, we have many more parameters now, and computng the gradent becomes a lot more dffcult. Furthermore, unlke wth logstc regresson, the neural network obectve s not convex, because the weghts W (l) and b (l) have complex nonlnear effects on the output. That means that gradent ascent can get stuck n local optma when optmzng neural networks, and we cannot n general guarantee a globally optmal soluton. In practce, especally for smaller networks, t s often not hard to fnd good 2
enough solutons wth gradent ascent that are not globally optmal but stll perform well. For very large and deep networks, we often have to thnk about more sophstcated optmzaton algorthms (more on ths later). To summarze, the gradent ascent algorthm smply conssts of repeatedly applyng the followng operaton: θ (+) θ () + α L(θ () ). Note that L(θ () ) here s a huge vector that conssts of the concatenaton of the gradent wth respect to each weght matrx and bas vector. Assume that each weght matrx W (l) has M l rows and M l columns, then L(θ) can be wrtten as: L(θ) dw (), dw () 2, dw () M, dw (),2 dw () M,2 dw () M,M 0 db () db () M dw (2), dw (2) M 2, dw (2) M 2,M db (2) db (2) M 2 dw (L) M L,M L db (L) M L In practce, t can often be more convenent when mplementng gradent ascent for neural networks to smply compute the gradent wth respect to each matrx W (l) and vector b (l) and ncrement them ndvdually accordng to the gradent 3
ascent rule (whch has exactly the same effect). For example, n an obectorented framework, each matrx and vector can be ts own obect that knows how to compute ts own gradent and apply gradent ascent, or else concatenate ts gradent to the huge full gradent vector. In the next secton, we ll dscuss how we can compute the gradent wth respect to each weght matrx W (l). 2 Backpropagaton: base case To evaluate the output of a neural network, we recursvely evaluate each layer as followng: h (l) σ(w (l) h (l ) + b (l) ). We wll also ntroduce z (l) W (l) h (l ) + b (l), such that h (l) σ(z (l) ). Sometmes, you ll see h referred to as post-synaptc and z as pre-synaptc. Ths recursve evaluaton of a neural network s referred to as forward propagaton, because we are propagatng the actvatons from the nput forward to the output. To compute dervatves, we use the chan rule to dfferentate each layer of the neural network wth respect to the obectve. Ths algorthm proceeds from the end of the neural network back down to the frst layer, and s therefore referred to as backward propagaton or backpropagaton. Frst, let s revst the chan rule. If we have the composton of two functons f(g(x)), and we want df dx, we can evaluate t as: df dx df dg dg dx. For example, f f(y) ay, and g(x) bx, then df dx ab. The same dea apples n the multvarate case. Let s say that we have a vector x, and f(y) Ay, and g(x) Bx. Then df dx df ( ) dg df dg dx AB A,k B k,. dx, k Next, let s see how we can compute the gradent for a sngle-layer neural network, whch smply corresponds to logstc regresson. Although the output of ths network s D n the bnary case, we ll stll do the math for the multvarate case, assumng there are M outputs (t ust happens that M ). Ths wll make thngs more convenent later. The lkelhood s gven by L(θ) N log p(y x, W (), b () ). In the bnary case, we smply have log p(y x, W (), b () ) log(h () ) f y and log p(y x, W (), b () ) log( h () ) otherwse. If we assume for now that there s only one datapont (we ll see what happens wth multple dataponts later), we ust have dh () { y 0 : h () y : h () 4
We know that L(h () ) L(σ(z () )) L(σ(W () x + b () )), and therefore we can use the chan rule to get We already know that and therefore dz () dh () d dz () dh () dz dh () () dh () dz () s smply or. We know that h () σ(z () ) + exp( z () ) To see why ths s true, note that Note that entry n dh () dh () dz () dh () + exp( z () ), dh () ( σ(z)) + exp( z) + exp( z) exp( z () ) ( + exp( z () )) 2 exp( z) + exp( z). dh () σ(z () )( σ(z () )) 0 f, so we only need to pontwse multply each by σ(z () )( σ(z () )). Now we ust need to evaluate dw dz () () dz () dw () db dz () () dz () db. () We know that z () W () x + b (). Let s start wth the bas, we smply have db () M dz () dz () db () dz () snce b () only affects z (), and not any other z () where. Evaluatng the dervatve wth respect to the weghts matrx W () s a bt more complex. We have and therefore dw (), M 0 z () M k dz () k W (), x + b (), dz () k dw (), dz () snce we can see n the sum that z () only depends on W (), M. We can also express ths n matrx notaton as dw () dz () xt. 5, x, from to
3 Backpropagaton: recursve case ntro Now, let s say that we have a multlayer neural network. The last layer smply looks lke logstc regresson, except that now nstead of x, we have the actvatons n the second-to-last layer h (L ). We therefore have: db (L) dw (L) (h(l ) ) T. But how can we get the dervatves wth respect to the weghts n the precedng layers? Well, we note that the prevous layers only affect L va h (L ), so we smply need to know. Snce we know that dh (L ) and therefore we can derve dh (L ) M L z (L) W (L) h (L ) + b (L) M L z (L) W (L), h(l ) + b (L), dh (L ) M L n matrx notaton, ths smply becomes M L W (L), dh (L ) (W(L) ) T dz. (L) ( (W (L) ) T ), Now, we can smply proceed recursvely, and use n place of to dh (L ) dh (L) compute the dervatves wth respect to W (L ) and b (L ). 6