CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part III 2 Winter 2019

Size: px
Start display at page:

Download "CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part III 2 Winter 2019"

Transcription

1 CS224n: Natural Language Processng wth Deep Learnng Lecture Notes: Part III 2 Wnter 209 Course Instructors: Chrstopher Mannng, Rchard Socher 2 Authors: Roht Mundra, Aman Peddada, Rchard Socher, Qaong Yan Keyphrases: Neural networks Forward computaton Backward propagaton Neuron Unts Max-margn Loss Gradent checks Xaver parameter ntalzaton Learnng rates Adagrad Ths set of notes ntroduces sngle and multlayer neural networks, and how they can be used for classfcaton purposes We then dscuss how they can be traned usng a dstrbuted gradent descent technque known as backpropagaton We wll see how the chan rule can be used to make parameter updates sequentally After a rgorous mathematcal dscusson of neural networks, we wll dscuss some practcal tps and trcks n tranng neural networks nvolvng: neuron unts (non-lneartes), gradent checks, Xaver parameter ntalzaton, learnng rates, Adagrad, etc Lastly, we wll motvate the use of recurrent neural networks as a language model Neural Networks: Foundatons We establshed n our prevous dscussons the need for non-lnear classfers snce most data are not lnearly separable and thus, our classfcaton performance on them s lmted Neural networks are a famly of classfers wth non-lnear decson boundary as seen n Fgure Now that we know the sort of decson boundares neural networks create, let us see how they manage dong so A Neuron A neuron s a generc computatonal unt that takes n nputs and produces a sngle output What dfferentates the outputs of dfferent neurons s ther parameters (also referred to as ther weghts) One of the most popular choces for neurons s the "sgmod" or "bnary logstc regresson" unt Ths unt takes an n-dmensonal nput vector x and produces the scalar actvaton (output) a Ths neuron s also assocated wth an n-dmensonal weght vector, w, and a bas scalar, b The output of ths neuron s then: Fgure : We see here how a non-lnear decson boundary separates the data very well Ths s the prowess of neural networks Fun Fact: Neural networks are bologcally nspred classfers whch s why they are often called "artfcal neural networks" to dstngush them from the organc knd However, n realty human neural networks are so much more capable and complex from artfcal neural networks that t s usually better to not draw too many parallels between the two a = + exp( (w T x + b)) We can also combne the weghts and bas term above to equva- Neuron: A neuron s the fundamental buldng block of neural networks We wll see that a neuron can be one of many functons that allows for non-lneartes to accrue n the network

2 cs224n: natural language processng wth deep learnng lecture notes: part 2 lently formulate: a = + exp( [w T b] [x ]) Ths formulaton can be vsualzed n the manner shown n Fgure 2 2 A Sngle Layer of Neurons We extend the dea above to multple neurons by consderng the case where the nput x s fed as an nput to multple such neurons as shown n Fgure 3 If we refer to the dfferent neurons weghts as {w (),, w (m) } and the bases as {b,, b m }, we can say the respectve actvatons are {a,, a m }: Fgure 2: Ths mage captures how n a sgmod neuron, the nput vector x s frst scaled, summed, added to a bas unt, and then passed to the squashng sgmod functon a = a m = + exp(w ()T x + b ) + exp(w (m)t x + b m ) Let us defne the followng abstractons to keep the notaton smple and useful for more complex networks: σ(z) = +exp(z ) +exp(z m ) b b m b = Rm w ()T W = R m n w (m)t We can now wrte the output of scalng and bases as: z = Wx + b The actvatons of the sgmod functon can then be wrtten as: a () a (m) = σ(z) = σ(wx + b) Fgure 3: Ths mage captures how multple sgmod unts are stacked on the rght, all of whch receve the same nput x

3 cs224n: natural language processng wth deep learnng lecture notes: part 3 So what do these actvatons really tell us? Well, one can thnk of these actvatons as ndcators of the presence of some weghted combnaton of features We can then use a combnaton of these actvatons to perform classfcaton tasks 3 Feed-forward Computaton So far we have seen how an nput vector x R n can be fed to a layer of sgmod unts to create actvatons a R m But what s the ntuton behnd dong so? Let us consder the followng namedentty recognton (NER) problem n NLP as an example: "Museums n Pars are amazng" Here, we want to classfy whether or not the center word "Pars" s a named-entty In such cases, t s very lkely that we would not ust want to capture the presence of words n the wndow of word vectors but some other nteractons between the words n order to make the classfcaton For nstance, maybe t should matter that "Museums" s the frst word only f "n" s the second word Such non-lnear decsons can often not be captured by nputs fed drectly to a Softmax functon but nstead requre the scorng of the ntermedate layer dscussed n Secton 2 We can thus use another matrx U R m to generate an unnormalzed score for a classfcaton task from the actvatons: s = U T a = U T f (Wx + b) Dmensons for a sngle hdden layer neural network: If we represent each word usng a 4-dmensonal word vector and we use a 5-word wndow as nput, then the nput x R 20 If we use 8 sgmod unts n the hdden layer and generate score output from the actvatons, then W R 8 20, b R 8, U R 8, s R The stage-wse feed-forward computaton s then: z = Wx + b a = σ(z) s = U T a where f s the actvaton functon Analyss of Dmensons: If we represent each word usng a 4- dmensonal word vector and we use a 5-word wndow as nput (as n the above example), then the nput x R 20 If we use 8 sgmod unts n the hdden layer and generate score output from the actvatons, then W R 8 20, b R 8, U R 8, s R 4 Maxmum Margn Obectve Functon Lke most machne learnng models, neural networks also need an optmzaton obectve, a measure of error or goodness whch we want to mnmze or maxmze respectvely Here, we wll dscuss a popular error metrc known as the maxmum margn obectve The dea behnd usng ths obectve s to ensure that the score computed for "true" labeled data ponts s hgher than the score computed for "false" labeled data ponts Usng the prevous example, f we call the score computed for the "true" labeled wndow "Museums n Pars are amazng" as s and the Fgure 4: Ths mage captures how a smple feed-forward network mght compute ts output

4 cs224n: natural language processng wth deep learnng lecture notes: part 4 score computed for the "false" labeled wndow "Not all museums n Pars" as s c (subscrpted as c to sgnfy that the wndow s "corrupt") Then, our obectve functon would be to maxmze (s s c ) or to mnmze (s c s) However, we modfy our obectve to ensure that error s only computed f s c > s (s c s) > 0 The ntuton behnd dong ths s that we only care the the "true" data pont have a hgher score than the "false" data pont and that the rest does not matter Thus, we want our error to be (s c s) f s c > s else 0 Thus, our optmzaton obectve s now: mnmze J = max(s c s, 0) However, the above optmzaton obectve s rsky n the sense that t does not attempt to create a margn of safety We would want the "true" labeled data pont to score hgher than the "false" labeled data pont by some postve margn In other words, we would want error to be calculated f (s s c < ) and not ust when (s s c < 0) Thus, we modfy the optmzaton obectve: mnmze J = max( + s c s, 0) We can scale ths margn such that t s = and let the other parameters n the optmzaton problem adapt to ths wthout any change n performance For more nformaton on ths, read about functonal and geometrc margns - a topc often covered n the study of Support Vector Machnes Fnally, we defne the followng optmzaton obectve whch we optmze over all tranng wndows: mnmze J = max( + s c s, 0) The max-margn obectve functon s most commonly assocated wth Support Vector Machnes (SVMs) In the above formulaton s c = U T f (Wx c + b) and s = U T f (Wx + b) 5 Tranng wth Backpropagaton Elemental In ths secton we dscuss how we tran the dfferent parameters n the model when the cost J dscussed n Secton 4 s postve No parameter updates are necessary f the cost s 0 Snce we typcally update parameters usng gradent descent (or a varant such as SGD), we typcally need the gradent nformaton for any parameter as requred n the update equaton: θ (t+) = θ (t) α θ (t) J Backpropagaton s technque that allows us to use the chan rule of dfferentaton to calculate loss gradents for any parameter used n the feed-forward computaton on the model To understand ths further, let us understand the toy network shown n Fgure 5 for whch we wll perform backpropagaton Fgure 5: Ths s a 4-2- neural network where neuron on layer k receves nput z (k) and produces actvaton output a (k)

5 cs224n: natural language processng wth deep learnng lecture notes: part 5 Here, we use a neural network wth a sngle hdden layer and a sngle unt output Let us establsh some notaton that wll make t easer to generalze ths model later: x s an nput to the neural network s s the output of the neural network Each layer (ncludng the nput and output layers) has neurons whch receve an nput and produce an output The -th neuron of layer k receves the scalar nput z (k) actvaton output a (k) and produces the scalar We wll call the backpropagated error calculated at z (k) as δ (k) Layer refers to the nput layer and not the frst hdden layer For the nput layer, x = z () = a () W (k) s the transfer matrx that maps the output from the k-th layer to the nput to the (k + )-th Thus, W () = W and W (2) = U to put ths new generalzed notaton n perspectve of Secton 3 Let us begn: Suppose the cost J = ( + s c s) s postve and we want to perform the update of parameter W () 4 (n Fgure 5 and Fgure 6), we must realze that W () 4 only contrbutes to z(2) and thus a (2) Ths fact s crucal to understandng backpropagaton backpropagated gradents are only affected by values they contrbute to a (2) s consequently used n the forward computaton of score by multplcaton wth W (2) We can see from the max-margn loss that: Therefore we wll work wth J s = J s c = s W () here for smplcty Thus, Backpropagaton Notaton: x s an nput to the neural network s s the output of the neural network The -th neuron of layer k receves the scalar nput z (k) and produces the scalar actvaton output a (k) For the nput layer, x = z () = a () W (k) s the transfer matrx that maps the output from the k-th layer to the nput to the (k + )-th Thus, W () = W and W (2) = U T usng notaton from Secton 3 W (2) s W () a (2) W () = W(2) a (2) W () = W(2) a (2) W () = W (2) a (2) z (2) z (2) W () = W (2) f (z (2) ) z (2) z (2) W () = W (2) a (2) W () = W (2) f (z (2) ) z(2) W ()

6 cs224n: natural language processng wth deep learnng lecture notes: part 6 = W (2) f (z (2) ) = W (2) f (z (2) ) W () W () = W (2) f (z (2) )a () = δ (2) a () (b () + a () W() + a () 2 W() 2 + a () 3 W() 3 + a () 4 W() 4 ) (b () + a () k W () k ) k We see above that the gradent reduces to the product δ (2) a () where δ (2) s essentally the error propagatng backwards from the -th neuron n layer 2 a () s an nput fed to -th neuron n layer 2 when scaled by W Let us dscuss the "error sharng/dstrbuton" nterpretaton of backpropagaton better usng Fgure 6 as an example Say we were to update W () 4 : We start wth the an error sgnal of propagatng backwards from a (3) 2 We then multply ths error by the local gradent of the neuron whch maps z (3) to a (3) Ths happens to be n ths case and thus, the error s stll Ths s now known as δ (3) = 3 At ths pont, the error sgnal of has reached z (3) We now need to dstrbute the error sgnal so that the "far share" of the error reaches to a (2) 4 Ths amount s the (error sgnal at z (3) = δ (3) ) W(2) = W (2) Thus, the error at a (2) = W (2) 5 As we dd n step 2, we need to move the error across the neuron whch maps z (2) to a (2) We do ths by multplyng the error sgnal at a (2) by the local gradent of the neuron whch happens to be f (z (2) ) 6 Thus, the error sgnal at z (2) s f (z (2) )W(2) Ths s known as δ (2) 7 Fnally, we need to dstrbute the "far share" of the error to W () 4 by smply multplyng t by the nput t was responsble for forwardng, whch happens to be a () 4 8 Thus, the gradent of the loss wth respect to W () 4 be a () 4 f (z (2) )W(2) s calculated to Fgure 6: Ths subnetwork shows the relevant parts of the network requred to update W ()

7 cs224n: natural language processng wth deep learnng lecture notes: part 7 Notce that the result we arrve at usng ths approach s exactly the same as that we arrved at usng explct dfferentaton earler Thus, we can calculate error gradents wth respect to a parameter n the network usng ether the chan rule of dfferentaton or usng an error sharng and dstrbuted flow approach both of these approaches happen to do the exact same thng but t mght be helpful to thnk about them one way or another Bas Updates: Bas terms (such as b () ) are mathematcally equvalent to other weghts contrbutng to the neuron nput (z (2) ) as long as the nput beng forwarded s As such, the bas gradents for neuron on layer k s smply δ (k) For nstance, f we were updatng b () nstead of W () 4 above, the gradent would smply be f (z (2) )W(2) Generalzed steps to propagate δ (k) to δ (k ) : We have error δ (k) propagatng backwards from z (k), e neuron at layer k See Fgure 7 Fgure 7: Propagatng error from δ (k) to δ (k ) 2 We propagate ths error backwards to a (k ) the path weght W (k ) 3 Thus, the error receved at a (k ) s δ (k) W (k ) by multplyng δ (k) 4 However, a (k ) may have been forwarded to multple nodes n the next layer as shown n Fgure 8 It should receve responsblty for errors propagatng backward from node m n layer k too, usng the exact same mechansm 5 Thus, error receved at a (k ) s δ (k) W (k ) + δ m (k) W (k ) m 6 In fact, we can generalze ths to be δ (k) W (k ) 7 Now that we have the correct error at a (k ), we move t across neuron at layer k by multplyng wth wth the local gradent f (z (k ) ) 8 Thus, the error that reaches z (k ), called δ (k ) s f (z (k ) ) δ (k) W (k ) by Fgure 8: Propagatng error from δ (k) to δ (k )

8 cs224n: natural language processng wth deep learnng lecture notes: part 8 6 Tranng wth Backpropagaton Vectorzed So far, we dscussed how to calculate gradents for a gven parameter n the model Here we wll generalze the approach above so that we update weght matrces and bas vectors all at once Note that these are smply extensons of the above model that wll help buld ntuton for the way error propagaton can be done at a matrxvector level For a gven parameter W (k), we dentfed that the error gradent s smply δ (k+) a (k) As a remnder, W (k) s the matrx that maps a (k) to z (k+) We can thus establsh that the error gradent for the entre matrx W (k) s: δ (k+) a (k) δ (k+) a (k) 2 W (k) = δ (k+) 2 a (k) δ (k+) 2 a (k) 2 = δ(k+) a (k)t Thus, we can wrte an entre matrx gradent usng the outer product of the error vector propagatng nto the matrx and the actvatons forwarded by the matrx Now, we wll see how we can calculate the error vector δ (k) We = f (z (k) establshed earler usng Fgure 8 that δ (k) Ths can easly generalze to matrces such that: δ (k) = f (z (k) ) (W (k)t δ (k+) ) ) δ (k+) W (k) Error propagates from layer (k + ) to (k) n the followng manner: δ (k) = f (z (k) ) (W (k)t δ (k+) ) Of course, ths assumes that n the forward propagaton the sgnal z (k) frst goes through actvaton neurons f to generate actvatons a (k) and are then lnearly combned to yeld z (k+) va transfer matrx W (k) In the above formulaton, the operator corresponds to an element wse product between elements of vectors ( : R N R N R N ) Computatonal effcency: Havng explored element-wse updates as well as vector-wse updates, we must realze that the vectorzed mplementatons run substantally faster n scentfc computng envronments such as MATLAB or Python (usng NumPy/ScPy packages) Thus, we should use vectorzed mplementaton n practce Furthermore, we should also reduce redundant calculatons n backpropagaton - for nstance, notce that δ (k) depends drectly on δ (k+) Thus, we should ensure that when we update W (k) usng δ (k+), we save δ (k+) to later derve δ (k) and we then repeat ths for (k ) () Such a recursve procedure s what makes backpropagaton a computatonally affordable procedure 2 Neural Networks: Tps and Trcks Havng dscussed the mathematcal foundatons of neural networks, we wll now dve nto some tps and trcks commonly employed

9 cs224n: natural language processng wth deep learnng lecture notes: part 9 when usng neural networks n practce 2 Gradent Check In the last secton, we dscussed n detal how to calculate error gradents/updates for parameters n a neural network model va calculus-based (analytc) methods Here we now ntroduce a technque of numercally approxmatng these gradents though too computatonally neffcent to be used drectly for tranng the networks, ths method wll allow us to very precsely estmate the dervatve wth respect to any parameter; t can thus serve as a useful santy check on the correctness of our analytc dervatves Gven a model wth parameter vector θ and loss functon J, the numercal gradent around θ s smply gven by centered dfference formula: f (θ) J(θ(+) ) J(θ ( ) ) 2ɛ where ɛ s a small number (usually around e 5 ) The term J(θ (+) ) s smply the error calculated on a forward pass for a gven nput when we perturb the parameter θ s th element by +ɛ Smlarly, the term J(θ ( ) ) s the error calculated on a forward pass for the same nput when we perturb the parameter θ s th element by ɛ Thus, usng two forward passes, we can approxmate the gradent wth respect to any gven parameter element n the model We note that ths defnton of the numercal gradent follows very naturally from the defnton of the dervatve, where, n the scalar case, f f (x + ɛ) f (x) (x) ɛ Of course, there s a slght dfference the defnton above only perturbs x n the postve drecton to compute the gradent Whle t would have been perfectly acceptable to defne the numercal gradent n ths way, n practce t s often more precse and stable to use the centered dfference formula, where we perturb a parameter n both drectons The ntuton s that to get a better approxmaton of the dervatve/slope around a pont, we need to examne the functon f s behavor both to the left and rght of that pont It can also be shown usng Taylor s theorem that the centered dfference formula has an error proportonal to ɛ 2, whch s qute small, whereas the dervatve defnton s more error-prone Now, a natural queston you mght ask s, f ths method s so precse, why do we not use t to compute all of our network gradents nstead of applyng back-propagaton? The smple answer, as hnted earler, s neffcency recall that every tme we want to compute the gradent wth respect to an element, we need to make two forward Gradent checks are a great way to compare analytcal and numercal gradents Analytcal gradents should be close and numercal gradents can be calculated usng: f (θ) J(θ(+) ) J(θ ( ) ) 2ɛ J(θ (+) ) and J(θ ( ) ) can be evaluated usng two forward passes An mplementaton of ths can be seen n Snppet 2

10 cs224n: natural language processng wth deep learnng lecture notes: part 0 passes through the network, whch wll be computatonally expensve Furthermore, many large-scale neural networks can contan mllons of parameters, and computng two passes per parameter s clearly not optmal And, snce n optmzaton technques such as SGD, we must compute the gradents once per teraton for several thousands of teratons, t s obvous that ths method quckly grows ntractable Ths neffcency s why we only use gradent check to verfy the correctness of our analytc gradents, whch are much qucker to compute A standard mplementaton of gradent check s shown below: Snppet 2 def eval_numercal_gradent(f, x): """ a nave mplementaton of numercal gradent of f at x - f should be a functon that takes a sngle argument - x s the pont (numpy array) to evaluate the gradent at """ fx = f(x) # evaluate functon value at orgnal pont grad = npzeros(xshape) h = # terate over all ndexes n x t = npndter(x, flags=[ mult_ndex ], op_flags=[ readwrte ]) whle not tfnshed: # evaluate functon at x+h x = tmult_ndex old_value = x[x] x[x] = old_value + h # ncrement by h fxh_left = f(x) # evaluate f(x + h) x[x] = old_value - h # decrement by h fxh_rght = f(x) # evaluate f(x - h) x[x] = old_value # restore to prevous value (very mportant!) # compute the partal dervatve grad[x] = (fxh_left - fxh_rght) / (2*h) # the slope tternext() # step to next dmenson return grad

11 cs224n: natural language processng wth deep learnng lecture notes: part 22 Regularzaton As wth many machne learnng models, neural networks are hghly prone to overfttng, where a model s able to obtan near perfect performance on the tranng dataset, but loses the ablty to generalze to unseen data A common technque used to address overfttng (an ssue also known as the hgh-varance problem ) s the ncorporaton of an L 2 regularzaton penalty The dea s that we wll smply append an extra term to our loss functon J, so that the overall cost s now calculated as: In the above formulaton, J R = J + λ L = W () F W () F s the Frobenus norm of the matrx W () (the -th weght matrx n the network) and λ s the hyper-parameter controllng how much weght the regularzaton term has relatve to the orgnal cost functon Snce we are tryng to mnmze J R, what regularzaton s essentally dong s penalzng weghts for beng too large whle optmzng over the orgnal cost functon Due to the quadratc nature of the Frobenus norm (whch computes the sum of the squared elements of a matrx), L 2 - regularzaton effectvely reduces the flexblty of the model and thereby reduces the overfttng phenomenon Imposng such a constrant can also be nterpreted as the pror Bayesan belef that the optmal weghts are close to zero how close depends on the value of λ Choosng the rght value of λ s crtcal, and must be chosen va hyperparameter-tunng Too hgh a value of λ causes most of the weghts to be set too close to 0, and the model does not learn anythng meanngful from the tranng data, often obtanng poor accuracy on tranng, valdaton, and testng sets Too low a value, and we fall nto the doman of overfttng once agan It must be noted that the bas terms are not regularzed and do not contrbute to the cost term above try thnkng about why ths s the case! There are ndeed other types of regularzaton that are sometmes used, such as L regularzaton, whch sums over the absolute values (rather than squares) of parameter elements however, ths s less commonly appled n practce snce t leads to sparsty of parameter weghts In the next secton, we dscuss dropout, whch effectvely acts as another form of regularzaton by randomly droppng (e settng to zero) neurons n the forward pass The Frobenus Norm of a matrx U s defned as follows: U F = U 2

12 cs224n: natural language processng wth deep learnng lecture notes: part 2 23 Dropout Dropout s a powerful technque for regularzaton, frst ntroduced by Srvastava et al n Dropout: A Smple Way to Prevent Neural Networks from Overfttng The dea s smple yet effectve durng tranng, we wll randomly drop wth some probablty ( p) a subset of neurons durng each forward/backward pass (or equvalently, we wll keep alve each neuron wth a probablty p) Then, durng testng, we wll use the full network to compute our predctons The result s that the network typcally learns more meanngful nformaton from the data, s less lkely to overft, and usually obtans hgher performance overall on the task at hand One ntutve reason why ths technque should be so effectve s that what dropout s dong s essentally dong s tranng exponentally many smaller networks at once and averagng over ther predctons In practce, the way we ntroduce dropout s that we take the output h of each layer of neurons, and keep each neuron wth probablty p, and else set t to 0 Then, durng back-propagaton, we only pass gradents through neurons that were kept alve durng the forward pass Fnally, durng testng, we compute the forward pass usng all of the neurons n the network However, a key subtlety s that n order for dropout to work effectvely, the expected output of a neuron durng testng should be approxmately the same as t was durng tranng else the magntude of the outputs could be radcally dfferent, and the behavor of the network s no longer well-defned Thus, we must typcally dvde the outputs of each neuron durng testng by a certan value t s left as an exercse to the reader to determne what ths value should be n order for the expected outputs durng tranng and testng to be equvalent Dropout appled to an artfcal neural network Image credts to Srvastava et al 24 Neuron Unts So far we have dscussed neural networks that contan sgmodal neurons to ntroduce nonlneartes; however n many applcatons better networks can be desgned usng other actvaton functons Some common choces are lsted here wth ther functon and gradent defntons and these can be substtuted wth the sgmodal functons dscussed above Sgmod: Ths s the default choce we have dscussed; the actvaton functon σ s gven by: σ(z) = + exp( z) Fgure 9: The response of a sgmod nonlnearty

13 cs224n: natural language processng wth deep learnng lecture notes: part 3 The gradent of σ(z) s: σ (z) = where σ(z) (0, ) exp( z) = σ(z)( σ(z)) + exp( z) Tanh: The tanh functon s an alternatve to the sgmod functon that s often found to converge faster n practce The prmary dfference between tanh and sgmod s that tanh output ranges from to whle the sgmod ranges from 0 to tanh(z) = where tanh(z) (, ) exp(z) exp( z) exp(z) + exp( z) = 2σ(2z) Fgure 0: The response of a tanh nonlnearty The gradent of tanh(z) s: tanh (z) = ( ) 2 exp(z) exp( z) = tanh 2 (z) exp(z) + exp( z) Hard tanh: The hard tanh functon s sometmes preferred over the tanh functon snce t s computatonally cheaper It does however saturate for magntudes of z greater than The actvaton of the hard tanh s: : z < hardtanh(z) = z : z : z > The dervatve can also be expressed n a pecewse functonal form: { hardtanh : z (z) = 0 : otherwse Fgure : The response of a hard tanh nonlnearty Soft sgn: The soft sgn functon s another nonlnearty whch can be consdered an alternatve to tanh snce t too does not saturate as easly as hard clpped functons: softsgn(z) = The dervatve s the expressed as: z + z softsgn (z) = sgn(z) ( + z) 2 Fgure 2: The response of a soft sgn nonlnearty where sgn s the sgnum functon whch returns ± dependng on the sgn of z

14 cs224n: natural language processng wth deep learnng lecture notes: part 4 ReLU: The ReLU (Rectfed Lnear Unt) functon s a popular choce of actvaton snce t does not saturate even for larger values of z and has found much success n computer vson applcatons: rect(z) = max(z, 0) The dervatve s then the pecewse functon: { rect : z > 0 (z) = 0 : otherwse Fgure 3: The response of a ReLU nonlnearty Leaky ReLU: Tradtonal ReLU unts by desgn do not propagate any error for non-postve z the leaky ReLU modfes ths such that a small error s allowed to propagate backwards even when z s negatve: leaky(z) = max(z, k z) where 0 < k < Ths way, the dervatve s representable as: { leaky : z > 0 (z) = k : otherwse Fgure 4: The response of a leaky ReLU nonlnearty 25 Data Preprocessng As s the case wth machne learnng models generally, a key step to ensurng that your model obtans reasonable performance on the task at hand s to perform basc preprocessng on your data Some common technques are outlned below Mean Subtracton Gven a set of nput data X, t s customary to zero-center the data by subtractng the mean feature vector of X from X An mportant pont s that n practce, the mean s calculated only across the tranng set, and ths mean s subtracted from the tranng, valdaton, and testng sets Normalzaton Another frequently used technque (though perhaps less so than mean subtracton) s to scale every nput feature dmenson to have smlar ranges of magntudes Ths s useful snce nput features are often measured n dfferent unts, but we often want to ntally consder all features as equally mportant The way we accomplsh ths s by smply dvdng the features by ther respectve standard devaton calculated across the tranng set

15 cs224n: natural language processng wth deep learnng lecture notes: part 5 Whtenng Not as commonly used as mean-subtracton + normalzaton, whtenng essentally converts the data to a have an dentty covarance matrx that s, features become uncorrelated and have a varance of Ths s done by frst mean-subtractng the data, as usual, to get X We can then take the Sngular Value Decomposton (SVD) of X to get matrces U, S, V We then compute UX to proect X nto the bass defned by the columns of U We fnally dvde each dmenson of the result by the correspondng sngular value n S to scale our data approprately (f a sngular value s zero, we can ust dvde by a small number nstead) 26 Parameter Intalzaton A key step towards achevng superlatve performance wth a neural network s ntalzng the parameters n a reasonable way A good startng strategy s to ntalze the weghts to small random numbers normally dstrbuted around 0 and n practce, ths often words acceptably well However, n Understandng the dffculty of tranng deep feedforward neural networks (200), Xaver et al study the effect of dfferent weght and bas ntalzaton schemes on tranng dynamcs The emprcal fndngs suggest that for sgmod and tanh actvaton unts, faster convergence and lower error rates are acheved when the weghts of a matrx W R n(l+) n (l) are ntalzed randomly wth a unform dstrbuton as follows: [ 6 W U n (l) + n (l+), ] 6 n (l) + n (l+) Where n (l) s the number of nput unts to W (fan-n) and n (l+) s the number of output unts from W (fan-out) In ths parameter ntalzaton scheme, bas unts are ntalzed to 0 Ths approach attempts to mantan actvaton varances as well as backpropagated gradent varances across layers Wthout such ntalzaton, the gradent varances (whch are a proxy for nformaton) generally decrease wth backpropagaton across layers 27 Learnng Strateges The rate/magntude of model parameter updates durng tranng can be controlled usng the learnng rate In the followng nave Gradent Descent formulaton, α s the learnng rate: θ new = θ old α θ J t (θ)

16 cs224n: natural language processng wth deep learnng lecture notes: part 6 You mght thnk that for fast convergence rates, we should set α to larger values however faster convergence s not guaranteed wth larger convergence rates In fact, wth very large learnng rates, we mght experence that the loss functon actually dverges because the parameters update causes the model to overshoot the convex mnma as shown n Fgure 5 In non-convex models (most of those we work wth), the outcome of a large learnng rate s unpredctable, but the chances of dvergng loss functons are very hgh The smple soluton to avodng a dvergng loss s to use a very small learnng rate so that we carefully scan the parameter space of course, f we use too small a learnng rate, we mght not converge n a reasonable amount of tme, or mght get caught n local mnma Thus, as wth any other hyperparameter, the learnng rate must be tuned effectvely Snce tranng s the most expensve phase n a deep learnng system, some research has attempted to mprove ths nave approach to settng learnng learnng rates For nstance, Ronan Collobert scales the learnng rate of a weght W (where W R n(l+) n (l) ) by the nverse square root of the fan-n of the neuron (n (l) ) There are several other technques that have proven to be effectve as well one such method s annealng, where, after several teratons, the learnng rate s reduced n some way ths method ensures that we start off wth a hgh learnng rate and approach a mnmum quckly; as we get closer to the mnmum, we start lowerng our learnng rate so that we can fnd the optmum under a more fne-graned scope A common way to perform annealng s to reduce the learnng rate α by a factor x after every n teratons of learnng Exponental decay s also common, where, the learnng rate α at teraton t s gven by α(t) = α 0 e kt, where α 0 s the ntal learnng rate, and k s a hyperparameter Another approach s to allow the learnng rate to decrease over tme such that: Fgure 5: Here we see that updatng parameter w 2 wth a large learnng rate can lead to dvergence of the error α(t) = α 0 τ max(t, τ) In the above scheme, α 0 s a tunable parameter and represents the startng learnng rate τ s also a tunable parameter and represents the tme at whch the learnng rate should start reducng In practce, ths method has been found to work qute well In the next secton we dscuss another method for adaptve gradent descent whch does not requre hand-set learnng rates 28 Momentum Updates Momentum methods, a varant of gradent descent nspred by the study of dynamcs and moton n physcs, attempt to use the veloc-

17 cs224n: natural language processng wth deep learnng lecture notes: part 7 ty of updates as a more effectve update scheme Pseudocode for momentum updates s shown below: Snppet 22 # Computes a standard momentum update # on parameters x v = mu*v - alpha*grad_x x += v 29 Adaptve Optmzaton Methods AdaGrad s an mplementaton of standard stochastc gradent descent (SGD) wth one key dfference: the learnng rate can vary for each parameter The learnng rate for each parameter depends on the hstory of gradent updates of that parameter n a way such that parameters wth a scarce hstory of updates are updated faster usng a larger learnng rate In other words, parameters that have not been updated much n the past are lkeler to have hgher learnng rates now Formally: θ t, = θ t, α g t, where g t, = t τ= g2 τ, θ t J t (θ) In ths technque, we see that f the RMS of the hstory of gradents s extremely low, the learnng rate s very hgh A smple mplementaton of ths technque s: Snppet 23 # Assume the gradent dx and parameter vector x cache += dx**2 x += - learnng_rate * dx / npsqrt(cache + e-8) Other common adaptve methods are RMSProp and Adam, whose update rules are shown below (courtesy of Andre Karpathy): Snppet 24 # Update rule for RMS prop cache = decay_rate * cache + ( - decay_rate) * dx**2 x += - learnng_rate * dx / (npsqrt(cache) + eps) Snppet 25 # Update rule for Adam

18 cs224n: natural language processng wth deep learnng lecture notes: part 8 m = beta*m + (-beta)*dx v = beta2*v + (-beta2)*(dx**2) x += - learnng_rate * m / (npsqrt(v) + eps) RMSProp s a varant of AdaGrad that utlzes a movng average of squared gradents n partcular, unlke AdaGrad, ts updates do not become monotoncally smaller The Adam update rule s n turn a varant of RMSProp, but wth the addton of momentumlke updates We refer the reader to the respectve sources of these methods for more detaled analyses of ther behavor

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester 0/25/6 Admn Assgnment 7 Class /22 Schedule for the rest of the semester NEURAL NETWORKS Davd Kauchak CS58 Fall 206 Perceptron learnng algorthm Our Nervous System repeat untl convergence (or for some #

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Neural networks. Nuno Vasconcelos ECE Department, UCSD Neural networs Nuno Vasconcelos ECE Department, UCSD Classfcaton a classfcaton problem has two types of varables e.g. X - vector of observatons (features) n the world Y - state (class) of the world x X

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Chapter Newton s Method

Chapter Newton s Method Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January,

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014 COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #16 Scrbe: Yannan Wang Aprl 3, 014 1 Introducton The goal of our onlne learnng scenaro from last class s C comparng wth best expert and

More information

Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Some modelling aspects for the Matlab implementation of MMA

Some modelling aspects for the Matlab implementation of MMA Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Which Separator? Spring 1

Which Separator? Spring 1 Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}. CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

Multilayer neural networks

Multilayer neural networks Lecture Multlayer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Mdterm exam Mdterm Monday, March 2, 205 In-class (75 mnutes) closed book materal covered by February 25, 205 Multlayer

More information

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015 CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research

More information

Evaluation of classifiers MLPs

Evaluation of classifiers MLPs Lecture Evaluaton of classfers MLPs Mlos Hausrecht mlos@cs.ptt.edu 539 Sennott Square Evaluaton For any data set e use to test the model e can buld a confuson matrx: Counts of examples th: class label

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

Lecture 10 Support Vector Machines. Oct

Lecture 10 Support Vector Machines. Oct Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron

More information

CS294A Lecture notes. Andrew Ng

CS294A Lecture notes. Andrew Ng CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Module 9. Lecture 6. Duality in Assignment Problems

Module 9. Lecture 6. Duality in Assignment Problems Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept

More information

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan Kernels n Support Vector Machnes Based on lectures of Martn Law, Unversty of Mchgan Non Lnear separable problems AND OR NOT() The XOR problem cannot be solved wth a perceptron. XOR Per Lug Martell - Systems

More information

Introduction to the Introduction to Artificial Neural Network

Introduction to the Introduction to Artificial Neural Network Introducton to the Introducton to Artfcal Neural Netork Vuong Le th Hao Tang s sldes Part of the content of the sldes are from the Internet (possbly th modfcatons). The lecturer does not clam any onershp

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

CSE 546 Midterm Exam, Fall 2014(with Solution)

CSE 546 Midterm Exam, Fall 2014(with Solution) CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought:

More information

Multi-layer neural networks

Multi-layer neural networks Lecture 0 Mult-layer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Lnear regresson w Lnear unts f () Logstc regresson T T = w = p( y =, w) = g( w ) w z f () = p ( y = ) w d w d Gradent

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

CS294A Lecture notes. Andrew Ng

CS294A Lecture notes. Andrew Ng CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and

More information

APPENDIX A Some Linear Algebra

APPENDIX A Some Linear Algebra APPENDIX A Some Lnear Algebra The collecton of m, n matrces A.1 Matrces a 1,1,..., a 1,n A = a m,1,..., a m,n wth real elements a,j s denoted by R m,n. If n = 1 then A s called a column vector. Smlarly,

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Support Vector Machines

Support Vector Machines Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class

More information

Training Convolutional Neural Networks

Training Convolutional Neural Networks Tranng Convolutonal Neural Networks Carlo Tomas November 26, 208 The Soft-Max Smplex Neural networks are typcally desgned to compute real-valued functons y = h(x) : R d R e of ther nput x When a classfer

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

ECE559VV Project Report

ECE559VV Project Report ECE559VV Project Report (Supplementary Notes Loc Xuan Bu I. MAX SUM-RATE SCHEDULING: THE UPLINK CASE We have seen (n the presentaton that, for downlnk (broadcast channels, the strategy maxmzng the sum-rate

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence) /24/27 Prevew Fbonacc Sequence Longest Common Subsequence Dynamc programmng s a method for solvng complex problems by breakng them down nto smpler sub-problems. It s applcable to problems exhbtng the propertes

More information

Hopfield Training Rules 1 N

Hopfield Training Rules 1 N Hopfeld Tranng Rules To memorse a sngle pattern Suppose e set the eghts thus - = p p here, s the eght beteen nodes & s the number of nodes n the netor p s the value requred for the -th node What ll the

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

MATH 567: Mathematical Techniques in Data Science Lab 8

MATH 567: Mathematical Techniques in Data Science Lab 8 1/14 MATH 567: Mathematcal Technques n Data Scence Lab 8 Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 11, 2017 Recall We have: a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + + Intuton of Margn Consder ponts

More information

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17 Neural Networks Perceptrons and Backpropagaton Slke Bussen-Heyen Unverstät Bremen Fachberech 3 5th of Novemeber 2012 Neural Networks 1 / 17 Contents 1 Introducton 2 Unts 3 Network structure 4 Snglelayer

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law: CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 30 Multcollnearty Dr. Shalabh Department of Mathematcs and Statstcs Indan Insttute of Technology Kanpur 2 Remedes for multcollnearty Varous technques have

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that

More information

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM An elastc wave s a deformaton of the body that travels throughout the body n all drectons. We can examne the deformaton over a perod of tme by fxng our look

More information

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? Intuton of Margn Consder ponts A, B, and C We

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

Fisher Linear Discriminant Analysis

Fisher Linear Discriminant Analysis Fsher Lnear Dscrmnant Analyss Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan Fsher lnear

More information

Multigradient for Neural Networks for Equalizers 1

Multigradient for Neural Networks for Equalizers 1 Multgradent for Neural Netorks for Equalzers 1 Chulhee ee, Jnook Go and Heeyoung Km Department of Electrcal and Electronc Engneerng Yonse Unversty 134 Shnchon-Dong, Seodaemun-Ku, Seoul 1-749, Korea ABSTRACT

More information

Support Vector Machines

Support Vector Machines /14/018 Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x

More information

THE SUMMATION NOTATION Ʃ

THE SUMMATION NOTATION Ʃ Sngle Subscrpt otaton THE SUMMATIO OTATIO Ʃ Most of the calculatons we perform n statstcs are repettve operatons on lsts of numbers. For example, we compute the sum of a set of numbers, or the sum of the

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z ) C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z

More information

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations

1 Derivation of Rate Equations from Single-Cell Conductance (Hodgkin-Huxley-like) Equations Physcs 171/271 -Davd Klenfeld - Fall 2005 (revsed Wnter 2011) 1 Dervaton of Rate Equatons from Sngle-Cell Conductance (Hodgkn-Huxley-lke) Equatons We consder a network of many neurons, each of whch obeys

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

The Order Relation and Trace Inequalities for. Hermitian Operators

The Order Relation and Trace Inequalities for. Hermitian Operators Internatonal Mathematcal Forum, Vol 3, 08, no, 507-57 HIKARI Ltd, wwwm-hkarcom https://doorg/0988/mf088055 The Order Relaton and Trace Inequaltes for Hermtan Operators Y Huang School of Informaton Scence

More information

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression 11 MACHINE APPLIED MACHINE LEARNING LEARNING MACHINE LEARNING Gaussan Mture Regresson 22 MACHINE APPLIED MACHINE LEARNING LEARNING Bref summary of last week s lecture 33 MACHINE APPLIED MACHINE LEARNING

More information

Why feed-forward networks are in a bad shape

Why feed-forward networks are in a bad shape Why feed-forward networks are n a bad shape Patrck van der Smagt, Gerd Hrznger Insttute of Robotcs and System Dynamcs German Aerospace Center (DLR Oberpfaffenhofen) 82230 Wesslng, GERMANY emal smagt@dlr.de

More information

1 Matrix representations of canonical matrices

1 Matrix representations of canonical matrices 1 Matrx representatons of canoncal matrces 2-d rotaton around the orgn: ( ) cos θ sn θ R 0 = sn θ cos θ 3-d rotaton around the x-axs: R x = 1 0 0 0 cos θ sn θ 0 sn θ cos θ 3-d rotaton around the y-axs:

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Topic 5: Non-Linear Regression

Topic 5: Non-Linear Regression Topc 5: Non-Lnear Regresson The models we ve worked wth so far have been lnear n the parameters. They ve been of the form: y = Xβ + ε Many models based on economc theory are actually non-lnear n the parameters.

More information

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,

More information

Formulas for the Determinant

Formulas for the Determinant page 224 224 CHAPTER 3 Determnants e t te t e 2t 38 A = e t 2te t e 2t e t te t 2e 2t 39 If 123 A = 345, 456 compute the matrx product A adj(a) What can you conclude about det(a)? For Problems 40 43, use

More information

Chapter 13: Multiple Regression

Chapter 13: Multiple Regression Chapter 13: Multple Regresson 13.1 Developng the multple-regresson Model The general model can be descrbed as: It smplfes for two ndependent varables: The sample ft parameter b 0, b 1, and b are used to

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information