CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part III 2 Winter 2019

Size: px

Start display at page:

Download "CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part III 2 Winter 2019"

Gwen Richards
5 years ago
Views:

CS224n: Natural Language Processng wth Deep Learnng Lecture Notes: Part III 2 Wnter 209 Course Instructors: Chrstopher Mannng, Rchard Socher 2 Authors: Roht Mundra, Aman Peddada, Rchard Socher, Qaong

1 CS224n: Natural Language Processng wth Deep Learnng Lecture Notes: Part III 2 Wnter 209 Course Instructors: Chrstopher Mannng, Rchard Socher 2 Authors: Roht Mundra, Aman Peddada, Rchard Socher, Qaong Yan Keyphrases: Neural networks Forward computaton Backward propagaton Neuron Unts Max-margn Loss Gradent checks Xaver parameter ntalzaton Learnng rates Adagrad Ths set of notes ntroduces sngle and multlayer neural networks, and how they can be used for classfcaton purposes We then dscuss how they can be traned usng a dstrbuted gradent descent technque known as backpropagaton We wll see how the chan rule can be used to make parameter updates sequentally After a rgorous mathematcal dscusson of neural networks, we wll dscuss some practcal tps and trcks n tranng neural networks nvolvng: neuron unts (non-lneartes), gradent checks, Xaver parameter ntalzaton, learnng rates, Adagrad, etc Lastly, we wll motvate the use of recurrent neural networks as a language model Neural Networks: Foundatons We establshed n our prevous dscussons the need for non-lnear classfers snce most data are not lnearly separable and thus, our classfcaton performance on them s lmted Neural networks are a famly of classfers wth non-lnear decson boundary as seen n Fgure Now that we know the sort of decson boundares neural networks create, let us see how they manage dong so A Neuron A neuron s a generc computatonal unt that takes n nputs and produces a sngle output What dfferentates the outputs of dfferent neurons s ther parameters (also referred to as ther weghts) One of the most popular choces for neurons s the "sgmod" or "bnary logstc regresson" unt Ths unt takes an n-dmensonal nput vector x and produces the scalar actvaton (output) a Ths neuron s also assocated wth an n-dmensonal weght vector, w, and a bas scalar, b The output of ths neuron s then: Fgure : We see here how a non-lnear decson boundary separates the data very well Ths s the prowess of neural networks Fun Fact: Neural networks are bologcally nspred classfers whch s why they are often called "artfcal neural networks" to dstngush them from the organc knd However, n realty human neural networks are so much more capable and complex from artfcal neural networks that t s usually better to not draw too many parallels between the two a = + exp( (w T x + b)) We can also combne the weghts and bas term above to equva- Neuron: A neuron s the fundamental buldng block of neural networks We wll see that a neuron can be one of many functons that allows for non-lneartes to accrue n the network

cs224n: natural language processng wth deep learnng lecture notes: part 2 lently formulate: a = + exp( [w T b] [x ]) Ths formulaton can be vsualzed n the manner shown n Fgure 2 2 A Sngle Layer of

(),, w (m) } and the bases as {b,, b m }, we can say the respectve actvatons are {a,, a m }: Fgure 2: Ths mage captures how n a sgmod neuron, the nput vector x s frst scaled, summed, added to a bas

2 cs224n: natural language processng wth deep learnng lecture notes: part 2 lently formulate: a = + exp( [w T b] [x ]) Ths formulaton can be vsualzed n the manner shown n Fgure 2 2 A Sngle Layer of Neurons We extend the dea above to multple neurons by consderng the case where the nput x s fed as an nput to multple such neurons as shown n Fgure 3 If we refer to the dfferent neurons weghts as {w (),, w (m) } and the bases as {b,, b m }, we can say the respectve actvatons are {a,, a m }: Fgure 2: Ths mage captures how n a sgmod neuron, the nput vector x s frst scaled, summed, added to a bas unt, and then passed to the squashng sgmod functon a = a m = + exp(w ()T x + b ) + exp(w (m)t x + b m ) Let us defne the followng abstractons to keep the notaton smple and useful for more complex networks: σ(z) = +exp(z ) +exp(z m ) b b m b = Rm w ()T W = R m n w (m)t We can now wrte the output of scalng and bases as: z = Wx + b The actvatons of the sgmod functon can then be wrtten as: a () a (m) = σ(z) = σ(wx + b) Fgure 3: Ths mage captures how multple sgmod unts are stacked on the rght, all of whch receve the same nput x

3 cs224n: natural language processng wth deep learnng lecture notes: part 3 So what do these actvatons really tell us? Well, one can thnk of these actvatons as ndcators of the presence of some weghted combnaton of features We can then use a combnaton of these actvatons to perform classfcaton tasks 3 Feed-forward Computaton So far we have seen how an nput vector x R n can be fed to a layer of sgmod unts to create actvatons a R m But what s the ntuton behnd dong so? Let us consder the followng namedentty recognton (NER) problem n NLP as an example: "Museums n Pars are amazng" Here, we want to classfy whether or not the center word "Pars" s a named-entty In such cases, t s very lkely that we would not ust want to capture the presence of words n the wndow of word vectors but some other nteractons between the words n order to make the classfcaton For nstance, maybe t should matter that "Museums" s the frst word only f "n" s the second word Such non-lnear decsons can often not be captured by nputs fed drectly to a Softmax functon but nstead requre the scorng of the ntermedate layer dscussed n Secton 2 We can thus use another matrx U R m to generate an unnormalzed score for a classfcaton task from the actvatons: s = U T a = U T f (Wx + b) Dmensons for a sngle hdden layer neural network: If we represent each word usng a 4-dmensonal word vector and we use a 5-word wndow as nput, then the nput x R 20 If we use 8 sgmod unts n the hdden layer and generate score output from the actvatons, then W R 8 20, b R 8, U R 8, s R The stage-wse feed-forward computaton s then: z = Wx + b a = σ(z) s = U T a where f s the actvaton functon Analyss of Dmensons: If we represent each word usng a 4- dmensonal word vector and we use a 5-word wndow as nput (as n the above example), then the nput x R 20 If we use 8 sgmod unts n the hdden layer and generate score output from the actvatons, then W R 8 20, b R 8, U R 8, s R 4 Maxmum Margn Obectve Functon Lke most machne learnng models, neural networks also need an optmzaton obectve, a measure of error or goodness whch we want to mnmze or maxmze respectvely Here, we wll dscuss a popular error metrc known as the maxmum margn obectve The dea behnd usng ths obectve s to ensure that the score computed for "true" labeled data ponts s hgher than the score computed for "false" labeled data ponts Usng the prevous example, f we call the score computed for the "true" labeled wndow "Museums n Pars are amazng" as s and the Fgure 4: Ths mage captures how a smple feed-forward network mght compute ts output

cs224n: natural language processng wth deep learnng lecture notes: part 4 score computed for the "false" labeled wndow "Not all museums n Pars" as s c (subscrpted as c to sgnfy that the wndow s

4 cs224n: natural language processng wth deep learnng lecture notes: part 4 score computed for the "false" labeled wndow "Not all museums n Pars" as s c (subscrpted as c to sgnfy that the wndow s "corrupt") Then, our obectve functon would be to maxmze (s s c ) or to mnmze (s c s) However, we modfy our obectve to ensure that error s only computed f s c > s (s c s) > 0 The ntuton behnd dong ths s that we only care the the "true" data pont have a hgher score than the "false" data pont and that the rest does not matter Thus, we want our error to be (s c s) f s c > s else 0 Thus, our optmzaton obectve s now: mnmze J = max(s c s, 0) However, the above optmzaton obectve s rsky n the sense that t does not attempt to create a margn of safety We would want the "true" labeled data pont to score hgher than the "false" labeled data pont by some postve margn In other words, we would want error to be calculated f (s s c < ) and not ust when (s s c < 0) Thus, we modfy the optmzaton obectve: mnmze J = max( + s c s, 0) We can scale ths margn such that t s = and let the other parameters n the optmzaton problem adapt to ths wthout any change n performance For more nformaton on ths, read about functonal and geometrc margns - a topc often covered n the study of Support Vector Machnes Fnally, we defne the followng optmzaton obectve whch we optmze over all tranng wndows: mnmze J = max( + s c s, 0) The max-margn obectve functon s most commonly assocated wth Support Vector Machnes (SVMs) In the above formulaton s c = U T f (Wx c + b) and s = U T f (Wx + b) 5 Tranng wth Backpropagaton Elemental In ths secton we dscuss how we tran the dfferent parameters n the model when the cost J dscussed n Secton 4 s postve No parameter updates are necessary f the cost s 0 Snce we typcally update parameters usng gradent descent (or a varant such as SGD), we typcally need the gradent nformaton for any parameter as requred n the update equaton: θ (t+) = θ (t) α θ (t) J Backpropagaton s technque that allows us to use the chan rule of dfferentaton to calculate loss gradents for any parameter used n the feed-forward computaton on the model To understand ths further, let us understand the toy network shown n Fgure 5 for whch we wll perform backpropagaton Fgure 5: Ths s a 4-2- neural network where neuron on layer k receves nput z (k) and produces actvaton output a (k)

5 cs224n: natural language processng wth deep learnng lecture notes: part 5 Here, we use a neural network wth a sngle hdden layer and a sngle unt output Let us establsh some notaton that wll make t easer to generalze ths model later: x s an nput to the neural network s s the output of the neural network Each layer (ncludng the nput and output layers) has neurons whch receve an nput and produce an output The -th neuron of layer k receves the scalar nput z (k) actvaton output a (k) and produces the scalar We wll call the backpropagated error calculated at z (k) as δ (k) Layer refers to the nput layer and not the frst hdden layer For the nput layer, x = z () = a () W (k) s the transfer matrx that maps the output from the k-th layer to the nput to the (k + )-th Thus, W () = W and W (2) = U to put ths new generalzed notaton n perspectve of Secton 3 Let us begn: Suppose the cost J = ( + s c s) s postve and we want to perform the update of parameter W () 4 (n Fgure 5 and Fgure 6), we must realze that W () 4 only contrbutes to z(2) and thus a (2) Ths fact s crucal to understandng backpropagaton backpropagated gradents are only affected by values they contrbute to a (2) s consequently used n the forward computaton of score by multplcaton wth W (2) We can see from the max-margn loss that: Therefore we wll work wth J s = J s c = s W () here for smplcty Thus, Backpropagaton Notaton: x s an nput to the neural network s s the output of the neural network The -th neuron of layer k receves the scalar nput z (k) and produces the scalar actvaton output a (k) For the nput layer, x = z () = a () W (k) s the transfer matrx that maps the output from the k-th layer to the nput to the (k + )-th Thus, W () = W and W (2) = U T usng notaton from Secton 3 W (2) s W () a (2) W () = W(2) a (2) W () = W(2) a (2) W () = W (2) a (2) z (2) z (2) W () = W (2) f (z (2) ) z (2) z (2) W () = W (2) a (2) W () = W (2) f (z (2) ) z(2) W ()

6 cs224n: natural language processng wth deep learnng lecture notes: part 6 = W (2) f (z (2) ) = W (2) f (z (2) ) W () W () = W (2) f (z (2) )a () = δ (2) a () (b () + a () W() + a () 2 W() 2 + a () 3 W() 3 + a () 4 W() 4 ) (b () + a () k W () k ) k We see above that the gradent reduces to the product δ (2) a () where δ (2) s essentally the error propagatng backwards from the -th neuron n layer 2 a () s an nput fed to -th neuron n layer 2 when scaled by W Let us dscuss the "error sharng/dstrbuton" nterpretaton of backpropagaton better usng Fgure 6 as an example Say we were to update W () 4 : We start wth the an error sgnal of propagatng backwards from a (3) 2 We then multply ths error by the local gradent of the neuron whch maps z (3) to a (3) Ths happens to be n ths case and thus, the error s stll Ths s now known as δ (3) = 3 At ths pont, the error sgnal of has reached z (3) We now need to dstrbute the error sgnal so that the "far share" of the error reaches to a (2) 4 Ths amount s the (error sgnal at z (3) = δ (3) ) W(2) = W (2) Thus, the error at a (2) = W (2) 5 As we dd n step 2, we need to move the error across the neuron whch maps z (2) to a (2) We do ths by multplyng the error sgnal at a (2) by the local gradent of the neuron whch happens to be f (z (2) ) 6 Thus, the error sgnal at z (2) s f (z (2) )W(2) Ths s known as δ (2) 7 Fnally, we need to dstrbute the "far share" of the error to W () 4 by smply multplyng t by the nput t was responsble for forwardng, whch happens to be a () 4 8 Thus, the gradent of the loss wth respect to W () 4 be a () 4 f (z (2) )W(2) s calculated to Fgure 6: Ths subnetwork shows the relevant parts of the network requred to update W ()

cs224n: natural language processng wth deep learnng lecture notes: part 7 Notce that the result we arrve at usng ths approach s exactly the same as that we arrved at usng explct dfferentaton earler

happen to do the exact same thng but t mght be helpful to thnk about them one way or another Bas Updates: Bas terms (such as b () ) are mathematcally equvalent to other weghts contrbutng to the

7 cs224n: natural language processng wth deep learnng lecture notes: part 7 Notce that the result we arrve at usng ths approach s exactly the same as that we arrved at usng explct dfferentaton earler Thus, we can calculate error gradents wth respect to a parameter n the network usng ether the chan rule of dfferentaton or usng an error sharng and dstrbuted flow approach both of these approaches happen to do the exact same thng but t mght be helpful to thnk about them one way or another Bas Updates: Bas terms (such as b () ) are mathematcally equvalent to other weghts contrbutng to the neuron nput (z (2) ) as long as the nput beng forwarded s As such, the bas gradents for neuron on layer k s smply δ (k) For nstance, f we were updatng b () nstead of W () 4 above, the gradent would smply be f (z (2) )W(2) Generalzed steps to propagate δ (k) to δ (k ) : We have error δ (k) propagatng backwards from z (k), e neuron at layer k See Fgure 7 Fgure 7: Propagatng error from δ (k) to δ (k ) 2 We propagate ths error backwards to a (k ) the path weght W (k ) 3 Thus, the error receved at a (k ) s δ (k) W (k ) by multplyng δ (k) 4 However, a (k ) may have been forwarded to multple nodes n the next layer as shown n Fgure 8 It should receve responsblty for errors propagatng backward from node m n layer k too, usng the exact same mechansm 5 Thus, error receved at a (k ) s δ (k) W (k ) + δ m (k) W (k ) m 6 In fact, we can generalze ths to be δ (k) W (k ) 7 Now that we have the correct error at a (k ), we move t across neuron at layer k by multplyng wth wth the local gradent f (z (k ) ) 8 Thus, the error that reaches z (k ), called δ (k ) s f (z (k ) ) δ (k) W (k ) by Fgure 8: Propagatng error from δ (k) to δ (k )

8 cs224n: natural language processng wth deep learnng lecture notes: part 8 6 Tranng wth Backpropagaton Vectorzed So far, we dscussed how to calculate gradents for a gven parameter n the model Here we wll generalze the approach above so that we update weght matrces and bas vectors all at once Note that these are smply extensons of the above model that wll help buld ntuton for the way error propagaton can be done at a matrxvector level For a gven parameter W (k), we dentfed that the error gradent s smply δ (k+) a (k) As a remnder, W (k) s the matrx that maps a (k) to z (k+) We can thus establsh that the error gradent for the entre matrx W (k) s: δ (k+) a (k) δ (k+) a (k) 2 W (k) = δ (k+) 2 a (k) δ (k+) 2 a (k) 2 = δ(k+) a (k)t Thus, we can wrte an entre matrx gradent usng the outer product of the error vector propagatng nto the matrx and the actvatons forwarded by the matrx Now, we wll see how we can calculate the error vector δ (k) We = f (z (k) establshed earler usng Fgure 8 that δ (k) Ths can easly generalze to matrces such that: δ (k) = f (z (k) ) (W (k)t δ (k+) ) ) δ (k+) W (k) Error propagates from layer (k + ) to (k) n the followng manner: δ (k) = f (z (k) ) (W (k)t δ (k+) ) Of course, ths assumes that n the forward propagaton the sgnal z (k) frst goes through actvaton neurons f to generate actvatons a (k) and are then lnearly combned to yeld z (k+) va transfer matrx W (k) In the above formulaton, the operator corresponds to an element wse product between elements of vectors ( : R N R N R N ) Computatonal effcency: Havng explored element-wse updates as well as vector-wse updates, we must realze that the vectorzed mplementatons run substantally faster n scentfc computng envronments such as MATLAB or Python (usng NumPy/ScPy packages) Thus, we should use vectorzed mplementaton n practce Furthermore, we should also reduce redundant calculatons n backpropagaton - for nstance, notce that δ (k) depends drectly on δ (k+) Thus, we should ensure that when we update W (k) usng δ (k+), we save δ (k+) to later derve δ (k) and we then repeat ths for (k ) () Such a recursve procedure s what makes backpropagaton a computatonally affordable procedure 2 Neural Networks: Tps and Trcks Havng dscussed the mathematcal foundatons of neural networks, we wll now dve nto some tps and trcks commonly employed

9 cs224n: natural language processng wth deep learnng lecture notes: part 9 when usng neural networks n practce 2 Gradent Check In the last secton, we dscussed n detal how to calculate error gradents/updates for parameters n a neural network model va calculus-based (analytc) methods Here we now ntroduce a technque of numercally approxmatng these gradents though too computatonally neffcent to be used drectly for tranng the networks, ths method wll allow us to very precsely estmate the dervatve wth respect to any parameter; t can thus serve as a useful santy check on the correctness of our analytc dervatves Gven a model wth parameter vector θ and loss functon J, the numercal gradent around θ s smply gven by centered dfference formula: f (θ) J(θ(+) ) J(θ ( ) ) 2ɛ where ɛ s a small number (usually around e 5 ) The term J(θ (+) ) s smply the error calculated on a forward pass for a gven nput when we perturb the parameter θ s th element by +ɛ Smlarly, the term J(θ ( ) ) s the error calculated on a forward pass for the same nput when we perturb the parameter θ s th element by ɛ Thus, usng two forward passes, we can approxmate the gradent wth respect to any gven parameter element n the model We note that ths defnton of the numercal gradent follows very naturally from the defnton of the dervatve, where, n the scalar case, f f (x + ɛ) f (x) (x) ɛ Of course, there s a slght dfference the defnton above only perturbs x n the postve drecton to compute the gradent Whle t would have been perfectly acceptable to defne the numercal gradent n ths way, n practce t s often more precse and stable to use the centered dfference formula, where we perturb a parameter n both drectons The ntuton s that to get a better approxmaton of the dervatve/slope around a pont, we need to examne the functon f s behavor both to the left and rght of that pont It can also be shown usng Taylor s theorem that the centered dfference formula has an error proportonal to ɛ 2, whch s qute small, whereas the dervatve defnton s more error-prone Now, a natural queston you mght ask s, f ths method s so precse, why do we not use t to compute all of our network gradents nstead of applyng back-propagaton? The smple answer, as hnted earler, s neffcency recall that every tme we want to compute the gradent wth respect to an element, we need to make two forward Gradent checks are a great way to compare analytcal and numercal gradents Analytcal gradents should be close and numercal gradents can be calculated usng: f (θ) J(θ(+) ) J(θ ( ) ) 2ɛ J(θ (+) ) and J(θ ( ) ) can be evaluated usng two forward passes An mplementaton of ths can be seen n Snppet 2

10 cs224n: natural language processng wth deep learnng lecture notes: part 0 passes through the network, whch wll be computatonally expensve Furthermore, many large-scale neural networks can contan mllons of parameters, and computng two passes per parameter s clearly not optmal And, snce n optmzaton technques such as SGD, we must compute the gradents once per teraton for several thousands of teratons, t s obvous that ths method quckly grows ntractable Ths neffcency s why we only use gradent check to verfy the correctness of our analytc gradents, whch are much qucker to compute A standard mplementaton of gradent check s shown below: Snppet 2 def eval_numercal_gradent(f, x): """ a nave mplementaton of numercal gradent of f at x - f should be a functon that takes a sngle argument - x s the pont (numpy array) to evaluate the gradent at """ fx = f(x) # evaluate functon value at orgnal pont grad = npzeros(xshape) h = # terate over all ndexes n x t = npndter(x, flags=[ mult_ndex ], op_flags=[ readwrte ]) whle not tfnshed: # evaluate functon at x+h x = tmult_ndex old_value = x[x] x[x] = old_value + h # ncrement by h fxh_left = f(x) # evaluate f(x + h) x[x] = old_value - h # decrement by h fxh_rght = f(x) # evaluate f(x - h) x[x] = old_value # restore to prevous value (very mportant!) # compute the partal dervatve grad[x] = (fxh_left - fxh_rght) / (2*h) # the slope tternext() # step to next dmenson return grad

11 cs224n: natural language processng wth deep learnng lecture notes: part 22 Regularzaton As wth many machne learnng models, neural networks are hghly prone to overfttng, where a model s able to obtan near perfect performance on the tranng dataset, but loses the ablty to generalze to unseen data A common technque used to address overfttng (an ssue also known as the hgh-varance problem ) s the ncorporaton of an L 2 regularzaton penalty The dea s that we wll smply append an extra term to our loss functon J, so that the overall cost s now calculated as: In the above formulaton, J R = J + λ L = W () F W () F s the Frobenus norm of the matrx W () (the -th weght matrx n the network) and λ s the hyper-parameter controllng how much weght the regularzaton term has relatve to the orgnal cost functon Snce we are tryng to mnmze J R, what regularzaton s essentally dong s penalzng weghts for beng too large whle optmzng over the orgnal cost functon Due to the quadratc nature of the Frobenus norm (whch computes the sum of the squared elements of a matrx), L 2 - regularzaton effectvely reduces the flexblty of the model and thereby reduces the overfttng phenomenon Imposng such a constrant can also be nterpreted as the pror Bayesan belef that the optmal weghts are close to zero how close depends on the value of λ Choosng the rght value of λ s crtcal, and must be chosen va hyperparameter-tunng Too hgh a value of λ causes most of the weghts to be set too close to 0, and the model does not learn anythng meanngful from the tranng data, often obtanng poor accuracy on tranng, valdaton, and testng sets Too low a value, and we fall nto the doman of overfttng once agan It must be noted that the bas terms are not regularzed and do not contrbute to the cost term above try thnkng about why ths s the case! There are ndeed other types of regularzaton that are sometmes used, such as L regularzaton, whch sums over the absolute values (rather than squares) of parameter elements however, ths s less commonly appled n practce snce t leads to sparsty of parameter weghts In the next secton, we dscuss dropout, whch effectvely acts as another form of regularzaton by randomly droppng (e settng to zero) neurons n the forward pass The Frobenus Norm of a matrx U s defned as follows: U F = U 2

12 cs224n: natural language processng wth deep learnng lecture notes: part 2 23 Dropout Dropout s a powerful technque for regularzaton, frst ntroduced by Srvastava et al n Dropout: A Smple Way to Prevent Neural Networks from Overfttng The dea s smple yet effectve durng tranng, we wll randomly drop wth some probablty ( p) a subset of neurons durng each forward/backward pass (or equvalently, we wll keep alve each neuron wth a probablty p) Then, durng testng, we wll use the full network to compute our predctons The result s that the network typcally learns more meanngful nformaton from the data, s less lkely to overft, and usually obtans hgher performance overall on the task at hand One ntutve reason why ths technque should be so effectve s that what dropout s dong s essentally dong s tranng exponentally many smaller networks at once and averagng over ther predctons In practce, the way we ntroduce dropout s that we take the output h of each layer of neurons, and keep each neuron wth probablty p, and else set t to 0 Then, durng back-propagaton, we only pass gradents through neurons that were kept alve durng the forward pass Fnally, durng testng, we compute the forward pass usng all of the neurons n the network However, a key subtlety s that n order for dropout to work effectvely, the expected output of a neuron durng testng should be approxmately the same as t was durng tranng else the magntude of the outputs could be radcally dfferent, and the behavor of the network s no longer well-defned Thus, we must typcally dvde the outputs of each neuron durng testng by a certan value t s left as an exercse to the reader to determne what ths value should be n order for the expected outputs durng tranng and testng to be equvalent Dropout appled to an artfcal neural network Image credts to Srvastava et al 24 Neuron Unts So far we have dscussed neural networks that contan sgmodal neurons to ntroduce nonlneartes; however n many applcatons better networks can be desgned usng other actvaton functons Some common choces are lsted here wth ther functon and gradent defntons and these can be substtuted wth the sgmodal functons dscussed above Sgmod: Ths s the default choce we have dscussed; the actvaton functon σ s gven by: σ(z) = + exp( z) Fgure 9: The response of a sgmod nonlnearty

13 cs224n: natural language processng wth deep learnng lecture notes: part 3 The gradent of σ(z) s: σ (z) = where σ(z) (0, ) exp( z) = σ(z)( σ(z)) + exp( z) Tanh: The tanh functon s an alternatve to the sgmod functon that s often found to converge faster n practce The prmary dfference between tanh and sgmod s that tanh output ranges from to whle the sgmod ranges from 0 to tanh(z) = where tanh(z) (, ) exp(z) exp( z) exp(z) + exp( z) = 2σ(2z) Fgure 0: The response of a tanh nonlnearty The gradent of tanh(z) s: tanh (z) = ( ) 2 exp(z) exp( z) = tanh 2 (z) exp(z) + exp( z) Hard tanh: The hard tanh functon s sometmes preferred over the tanh functon snce t s computatonally cheaper It does however saturate for magntudes of z greater than The actvaton of the hard tanh s: : z < hardtanh(z) = z : z : z > The dervatve can also be expressed n a pecewse functonal form: { hardtanh : z (z) = 0 : otherwse Fgure : The response of a hard tanh nonlnearty Soft sgn: The soft sgn functon s another nonlnearty whch can be consdered an alternatve to tanh snce t too does not saturate as easly as hard clpped functons: softsgn(z) = The dervatve s the expressed as: z + z softsgn (z) = sgn(z) ( + z) 2 Fgure 2: The response of a soft sgn nonlnearty where sgn s the sgnum functon whch returns ± dependng on the sgn of z

14 cs224n: natural language processng wth deep learnng lecture notes: part 4 ReLU: The ReLU (Rectfed Lnear Unt) functon s a popular choce of actvaton snce t does not saturate even for larger values of z and has found much success n computer vson applcatons: rect(z) = max(z, 0) The dervatve s then the pecewse functon: { rect : z > 0 (z) = 0 : otherwse Fgure 3: The response of a ReLU nonlnearty Leaky ReLU: Tradtonal ReLU unts by desgn do not propagate any error for non-postve z the leaky ReLU modfes ths such that a small error s allowed to propagate backwards even when z s negatve: leaky(z) = max(z, k z) where 0 < k < Ths way, the dervatve s representable as: { leaky : z > 0 (z) = k : otherwse Fgure 4: The response of a leaky ReLU nonlnearty 25 Data Preprocessng As s the case wth machne learnng models generally, a key step to ensurng that your model obtans reasonable performance on the task at hand s to perform basc preprocessng on your data Some common technques are outlned below Mean Subtracton Gven a set of nput data X, t s customary to zero-center the data by subtractng the mean feature vector of X from X An mportant pont s that n practce, the mean s calculated only across the tranng set, and ths mean s subtracted from the tranng, valdaton, and testng sets Normalzaton Another frequently used technque (though perhaps less so than mean subtracton) s to scale every nput feature dmenson to have smlar ranges of magntudes Ths s useful snce nput features are often measured n dfferent unts, but we often want to ntally consder all features as equally mportant The way we accomplsh ths s by smply dvdng the features by ther respectve standard devaton calculated across the tranng set

15 cs224n: natural language processng wth deep learnng lecture notes: part 5 Whtenng Not as commonly used as mean-subtracton + normalzaton, whtenng essentally converts the data to a have an dentty covarance matrx that s, features become uncorrelated and have a varance of Ths s done by frst mean-subtractng the data, as usual, to get X We can then take the Sngular Value Decomposton (SVD) of X to get matrces U, S, V We then compute UX to proect X nto the bass defned by the columns of U We fnally dvde each dmenson of the result by the correspondng sngular value n S to scale our data approprately (f a sngular value s zero, we can ust dvde by a small number nstead) 26 Parameter Intalzaton A key step towards achevng superlatve performance wth a neural network s ntalzng the parameters n a reasonable way A good startng strategy s to ntalze the weghts to small random numbers normally dstrbuted around 0 and n practce, ths often words acceptably well However, n Understandng the dffculty of tranng deep feedforward neural networks (200), Xaver et al study the effect of dfferent weght and bas ntalzaton schemes on tranng dynamcs The emprcal fndngs suggest that for sgmod and tanh actvaton unts, faster convergence and lower error rates are acheved when the weghts of a matrx W R n(l+) n (l) are ntalzed randomly wth a unform dstrbuton as follows: [ 6 W U n (l) + n (l+), ] 6 n (l) + n (l+) Where n (l) s the number of nput unts to W (fan-n) and n (l+) s the number of output unts from W (fan-out) In ths parameter ntalzaton scheme, bas unts are ntalzed to 0 Ths approach attempts to mantan actvaton varances as well as backpropagated gradent varances across layers Wthout such ntalzaton, the gradent varances (whch are a proxy for nformaton) generally decrease wth backpropagaton across layers 27 Learnng Strateges The rate/magntude of model parameter updates durng tranng can be controlled usng the learnng rate In the followng nave Gradent Descent formulaton, α s the learnng rate: θ new = θ old α θ J t (θ)

cs224n: natural language processng wth deep learnng lecture notes: part 6 You mght thnk that for fast convergence rates, we should set α to larger values however faster convergence s not guaranteed

16 cs224n: natural language processng wth deep learnng lecture notes: part 6 You mght thnk that for fast convergence rates, we should set α to larger values however faster convergence s not guaranteed wth larger convergence rates In fact, wth very large learnng rates, we mght experence that the loss functon actually dverges because the parameters update causes the model to overshoot the convex mnma as shown n Fgure 5 In non-convex models (most of those we work wth), the outcome of a large learnng rate s unpredctable, but the chances of dvergng loss functons are very hgh The smple soluton to avodng a dvergng loss s to use a very small learnng rate so that we carefully scan the parameter space of course, f we use too small a learnng rate, we mght not converge n a reasonable amount of tme, or mght get caught n local mnma Thus, as wth any other hyperparameter, the learnng rate must be tuned effectvely Snce tranng s the most expensve phase n a deep learnng system, some research has attempted to mprove ths nave approach to settng learnng learnng rates For nstance, Ronan Collobert scales the learnng rate of a weght W (where W R n(l+) n (l) ) by the nverse square root of the fan-n of the neuron (n (l) ) There are several other technques that have proven to be effectve as well one such method s annealng, where, after several teratons, the learnng rate s reduced n some way ths method ensures that we start off wth a hgh learnng rate and approach a mnmum quckly; as we get closer to the mnmum, we start lowerng our learnng rate so that we can fnd the optmum under a more fne-graned scope A common way to perform annealng s to reduce the learnng rate α by a factor x after every n teratons of learnng Exponental decay s also common, where, the learnng rate α at teraton t s gven by α(t) = α 0 e kt, where α 0 s the ntal learnng rate, and k s a hyperparameter Another approach s to allow the learnng rate to decrease over tme such that: Fgure 5: Here we see that updatng parameter w 2 wth a large learnng rate can lead to dvergence of the error α(t) = α 0 τ max(t, τ) In the above scheme, α 0 s a tunable parameter and represents the startng learnng rate τ s also a tunable parameter and represents the tme at whch the learnng rate should start reducng In practce, ths method has been found to work qute well In the next secton we dscuss another method for adaptve gradent descent whch does not requre hand-set learnng rates 28 Momentum Updates Momentum methods, a varant of gradent descent nspred by the study of dynamcs and moton n physcs, attempt to use the veloc-

17 cs224n: natural language processng wth deep learnng lecture notes: part 7 ty of updates as a more effectve update scheme Pseudocode for momentum updates s shown below: Snppet 22 # Computes a standard momentum update # on parameters x v = mu*v - alpha*grad_x x += v 29 Adaptve Optmzaton Methods AdaGrad s an mplementaton of standard stochastc gradent descent (SGD) wth one key dfference: the learnng rate can vary for each parameter The learnng rate for each parameter depends on the hstory of gradent updates of that parameter n a way such that parameters wth a scarce hstory of updates are updated faster usng a larger learnng rate In other words, parameters that have not been updated much n the past are lkeler to have hgher learnng rates now Formally: θ t, = θ t, α g t, where g t, = t τ= g2 τ, θ t J t (θ) In ths technque, we see that f the RMS of the hstory of gradents s extremely low, the learnng rate s very hgh A smple mplementaton of ths technque s: Snppet 23 # Assume the gradent dx and parameter vector x cache += dx**2 x += - learnng_rate * dx / npsqrt(cache + e-8) Other common adaptve methods are RMSProp and Adam, whose update rules are shown below (courtesy of Andre Karpathy): Snppet 24 # Update rule for RMS prop cache = decay_rate * cache + ( - decay_rate) * dx**2 x += - learnng_rate * dx / (npsqrt(cache) + eps) Snppet 25 # Update rule for Adam

18 cs224n: natural language processng wth deep learnng lecture notes: part 8 m = beta*m + (-beta)*dx v = beta2*v + (-beta2)*(dx**2) x += - learnng_rate * m / (npsqrt(v) + eps) RMSProp s a varant of AdaGrad that utlzes a movng average of squared gradents n partcular, unlke AdaGrad, ts updates do not become monotoncally smaller The Adam update rule s n turn a varant of RMSProp, but wth the addton of momentumlke updates We refer the reader to the respectve sources of these methods for more detaled analyses of ther behavor

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson