Implementation and Optimization of Differentiable Neural Computers

Size: px
Start display at page:

Download "Implementation and Optimization of Differentiable Neural Computers"

Transcription

1 Implemenaion and Opimizaion of Differeniable Neural Compuers Carol Hsin Graduae Suden in Compuaional & Mahemaical Engineering Sanford Universiy cshsin[a]sanford.edu Absrac We implemened and opimized Differeniable Neural Compuers (DNCs) as described in he Oc DNC paper [1] on he babi daase [25] and on copy asks ha were described in he Neural Turning Machine paper [12]. This paper will give he reader a beer undersanding of his new and promising archiecure hrough he documenaion of he approach in our DNC implemenaion and our experience of he challenges of opimizing DNCs. Given how recenly he DNC paper has come ou, oher han he original paper, here were no such explanaion, implemenaion and experimenaion of he same level of deail as his projec has produced, which is why his projec will be useful for ohers who wan o experimen wih DNCs since his projec successfully rained a high performing DNC on he copy ask while our DNC performance on he babi daase was beer han or equal o our LSTM baseline. 1. Inroducion One of he main challenges wih LSTMs/GRUs is ha hey have difficuly learning long-erm dependencies due o having o sore heir memories in hidden unis ha are essenially compressed, weigh-seleced inpu sequences. Hence, he exciemen when DeepMind released a paper his Ocober documening heir resuls on a new deep learning archiecure, dubbed he Differeniable Neural Compuer (DNCs), wih read and wrie/erase access o an exernal memory marix, hus allowing he model o learn over much longer ime sequences han curren LSTMs models. In his paper, we firs give a brief overview of how he DNC fis ino he deep learning hisorical framework since we view DNCs as RNNs he way LSTMs are RNNs, as explained in Sec. 3. The res of he paper focuses on he implemenaion and opimizaion of DNCs, which is essenially an unconsrained, non-convex opimizaion problem for which he KKT saionariy condiion mus hold for local opima (given ha vanishing gradiens is essenially no a problem for DNCs). A basic machine learning and linear algebra background has been assumed, so common funcions and conceps are used wihou inroducion wih some explained in he appendix in Sec. 6. We discovered ha undersanding and implemening a correc DNC is a non-rivial process as documened in Sec. 3. We also discovered ha he DNCs are difficul o rain, ake a long ime o converge and are prone o over-fiing and insabiliy hroughou he learning process, which were he same challenges researchers have experienced in raining NTMs, which are he DNC s direc predecessors [27]. Due o he compuaional cos in boh memory and ime of raining DNCs, we were no able o rain a DNC on he full join babi asks, bu on a join subse of asks ha would be rainable wihin a reasonable ime frame (3 days insead of weeks) as explained in Sec The DNC model was successful a performing beer han or equal o he LSTM baseline on hese asks. We also ran experimens on he copy ask described in [12], of which he resuls were highly successful and are documened in he appendix in Sec. 6.1 since his paper coul no fi all of he copy ask experimenaions and visualizaions in addiion o discussing he approach (which also served as a background on he heory of DNCs in addiion o documening our implemenaion process) and he full babi experimens, since even some of he babi plos and figures had o be placed in he appendix. We wan o emphasize ha he reader should consider he copy ask secion in he appendix as a coninuaion of he paper (even hough i was placed in he appendix o save space) since given he simpliciy and low RAM requiremens of he copy ask, more visualizaions using Tensorboard was possible, so Sec. 6.1 conains many plos ha can help he reader beer undersand DNCs, such as he visualizaion of DNC memory marices in Fig. 10. We implemened he DNC wih much uni-esing in Pyhon using Tensorflow 1.0. The experimens were on a deskop CPU and GPU, and an Azure VM wih one GPU. 2. Background/relaed work While neural neworks (NNs) have a long hisory [20], hey have mainly recenly gained populariy due o he recen availabiliy of large daases and massively parallel compuing power, usually in he form of GPUs, having creaed an environmen ha allowed he raining of deeper and more complex archiecures in a reasonable ime frame. As such, deep learning echniques have received widespread aenion by ouperforming alernaive mehods (e.g. kernel machines and HMMs) on benchmark problems firs in compuer vi- 1

2 sion [16] and speech recogniion [8] [11], and now in naural languge processing (NLP). The laer by ouperforming on asks including dependency parsing [4] [2], senimen analysis [21] and machine ranslaion [15]. Deep learning echniques have also performed well agains benchmarks in asks ha combine compuer vision and NLP, such as in imagecapioning [24] and lip-reading [7]. Wih hese successes, here seems o be a shif from radiional algorihms using human engineered feaures and represenaions o deep learning algorihms ha learn he represenaions from raw inpus (pixels, characers) o produce he desired oupu (class, sequence). In he case of NLP, much of he improvemens come from he sequence modeling capabiliies of recurren NNs (RNNs) [10] [23], wih heir abiliy o model long-erm dependencies improved by using a gaed acivaion funcion, such as in he long shor-erm memory (LSTM) [14] and gaed-feedback uni (GRU) [5] [6] RNN archiecures. The RNN model exends convenional feedforward neural neworks (FNNs) by reusing he weighs a each ime sep (hus reducing he number of parameers o rain) and condiioning he oupu on all previous words by a hidden memory sae (hus remembering he pas). In an RNN, each inpu is a sequence (x 1,..., x,..., x T ) of vecors x R X and each oupu is he predicion sequence (ŷ 1,..., ŷ,..., ŷ T ) of vecors ŷ R Y, usually made ino a probabiliy disribuion using he sofmax funcion. The hidden sae h R H is updaed a each ime sep as a funcion of a linear combinaion of he inpu x and he previous hidden sae h 1. Below, f( ) is usually a smooh, bounded funcion such as he logisic sigmoid σ( ) or hyperbolic angen anh( ) funcions ( [ ]) x h = f W ŷ h = sofmax(uh ) 1 Theoreically, RNNs can capure any long-erm dependencies in arbirary inpu sequences, bu in pracice, raining an RNN o do so is difficul since he gradiens end o vanish (o zero) or explode (o NaN, hough solved by clipped gradiens) [18] [13] [14]. The LSTM and GRU RNN models boh address his by using gaing unis o conrol he flow of informaion ino and ou of he memories; he LSTM, in paricular, have dedicaed memory cells and is he baseline archiecure used in our experimens. There are many LSTM formulaions and his projec uses he varian from [1] wihou he muliple layers: i = σ(w i [x ; h 1 ] + b i ) f = σ(w f [x ; h 1 ] + b f ) s = f s 1 + i anh(w s [x ; h 1 ] + b s ) o = σ(w o [x ; h 1 ] + b o ) h = o anh(s ) where i, f, s and o are he inpu gae, forge gae, sae and oupu gae vecors, respecively, and he W s and b s are he weigh marices and biases o be learned. Noe ha [1] calls he memory cell (or sae) s insead of he c generally used in lieraure [5] [6] [11] since [1] uses he leer c o denoe conen vecors in he DNC equaions, see Sec Wih he LSTM and GRU models, longer-erm dependencies can be learned, bu in pracice, hese models have limis on how long memories can persis, which become relevan in asks such as quesion answering (QA) where he relevan informaion for an answer could be a he very sar of he inpu sequence, which could consis of he vecorized words of several paragraphs [19]. This is he moivaion for fully differeniable (hus rainable by gradien descen) models ha conain a long-erm, relaively isolaed memory componen ha he model can learn o sore inpus o and read from when compuing he prediced oupu. Such memory-based models include Neural Turing Machines [12], Memory Neworks (MemNes and MemN2Ns) [26] [22], Dynamic Memory Neworks (DMN) [17] and Differeniable Neural Compuers (DNCs) [1], which can be viewed as a ype of NTM since hey were designed by he same researchers. Mem- Nes and DMNs are based on using muliple RNNs (LSTMs/GRUs) as modules, such separae RNN modules for inpu and oupu processing as in [23]. The DMN uses a GRU as he memory componen as opposed o he addressable memory in MemNes/MemN2Ns, NTMs and DNCs. While Mem- Nes/MemN2Ns have primarily been esed on NLP ask, NTMs were esed mainly on algorihmic asks, such as copy, recall and sor, while DNCs were esed a wider variey of asks achieving high performance in NLP asks on he babi daase [25], algorihmic asks such as compuing shores pah on graphs and a reinforcemen learning ask on a block puzzle game. Thus, DNCs are powerful models ha have a long-erm memory isolaed from compuaion, which RNNs/LSTMs lack, and have he poenial o replicae he capabiliies of a modern day compuer while being fully differeniable and hus rainable using convenional gradien-based mehods, which is why hey are he main opic of ineres of his projec. 3. Approach: DNC overview & implemenaion We implemened he DNC by inheriing from he Tensorflow RNNCell inerface, which is an absrac objec represening an RNN layer, even hough i is called a cell. This implemenaion approach akes advanage of Tensorflow s buil-in unraveling capabiliies so all ha is needed is o program he DNC logic in (oupu, new sae) = self. call (inpus, sae), in addiion o he oher, more rivial, required funcions. As shown in Fig. 1, his projec organizes he main DNC logic ino hree main module, shown in grey boxes which conain he in which secion hey are described DNC uiliy funcions Mos of he DNC uiliy funcions in Sec were easy o implemen or already provided in Tensorflow. Ohers were implemened based on recognizing how hey could be vecorized, such as for he conen similariy based weighing 2

3 Figure 1: Our approach o implemening he DNC. C(M, k, β), which resuls in a probabiliy disribuion vecor, i.e. in S N. The equaion in he original paper was wrien using cosine similariy D(, ) as C(M, k, β)[i] = exp{d(k, M[i, ]) β} j exp{d(k, M[i, ]) β} i = [1 : N] bu i could be vecorized and implemened as z = ˆM ˆk C(M, k, β) = sofmax(βz) where ˆM is M wih l 2 -normalized rows and ˆk is l 2 - normalized k. Noice ha normalizaion (used in he cosine similariy funcion Sec ) will resul in NaN errors due o division by zero since he memory marix M is iniialized o zero. This could be solve by adding a facor of ɛ = 1e 8 o he denominaor or insering a condiion o ignore he compuaion if he denominaor was zero, boh were implemened before he discovery of Tensorflow s buil-in normalize funcion, which was he roue aken in he end DNC NN Conroller We firs concaenae he inpu vecor of he curren ime sep x and he read vecors from he previous ime sep r 1 1,..., r R 1 o form he NN conroller inpu χ = [x ; r 1 1; ; r R 1] which is hen fed ino he NN cell of he NN conroller, where cell can be an LSTM or a FNN, which was coded using Tensorflow s defaul cell funcions wih he mahemaical definiions in Sec For consisency wih he LSTM, call h he oupu of he NN cell a ime sep, hen he final NN conroller s oupu is a marix produc wih h : [ ] ξ = W ν h h where W h is among he weighs θ learned by he gradien descen. Thus, if we le N refer o he NN conroller, hen we can wrie he NN conroller as: (ξ, ν ) = N ([χ 1 ; ; χ ]; θ) where [χ 1 ; ; χ ] indicaes dependence on previous elemens of he curren sequence. In his projec, afer he vecor concaenaion, he conroller is implemened using Tensorflow s defaul funcions rnn.basiclst M Cell and layers.dense for he NN cell, hen a final weigh marix muliply and vecor slicing o ge he wo vecors ξ and ν. As shown in Sec , he inerface vecor ξ is spli ino he inerface parameers, some of which were processed o he desired range by he uiliy funcions in Sec The inerface parameers are hen used in he memory updaes as explained in Sec The oher NN Conroller oupu, ν is used o compue he final DNC oupu y as shown in Sec in a linear combinaion wih he read vecors [r 1 ;... ; r R ] from he memory as shown in Sec DNC memory module implemenaion The inerface parameers are used o updae he memory marix (denoed as M R N W where N is he number of memory locaions, aka cells, and W is he memory word size, aka cell size) and compue inermediae memory-compuaion relaed vecors ha are used o compue he memory read vecors [r ; ; r R ] ha are used o boh compue he oupu of he curren ime sep y and o feed as inpu ino he NN conroller as par of χ +1 in he nex ime sep. In his projec, he memory ineracions are organized based on how hese ineracions can be divided ino wriing o memory and reading from memory, and hen furher subdivided based on conen-based address weighings and wha his projec calls hisory -based address weighings, since hey are compued based on previous read and wrie weighings. We use his noaion afer observing similariies beween conen vs hisory focusing in he DNC equaions and conen vs locaion focusing in he NTMs equaions [12]. To make his par of he paper more clear, we creaed a diagram of his projec s organizaion of he DNC memory implemenaion shown in Fig. 2. Hexagons mark he inerface variables compued from he NN conroller. Blue diamonds are variables ha are required o be kep and updaed each ime sep. Dashed circles imply inermediae compued values ha are merely consumed and forgoen. The variables used in he figures are he same as defined in [1], wih he DNC glossary in [1] reproduced in Sec for convenience, as are he memory equaions from he [1], which can be viewed in Sec Observe from Fig. 2 ha he DNC is srucurally similar o an LSTM excep he DNC has muliple vecor-soring memory cells (vs jus one in an LSTM) and he DNC has addressing mechanisms o choose from which memory cell(s) o wrie and read Conen-based wrie and read weighings An advanage of memory cells wih vecor values is o allow for a conen-based addressing, or sof aenion, mechanism ha can selec a memory cell based on he similariy (cosine for he DNC and NTM) of is conens o a specified key k [3] [9]. Recall ha he DNC uses he same conen-based focusing mechanism used in NTMs [12]; he main changes beween he wo papers were in noaion, which in he DNC is 3

4 Figure 2: Diagram of our approach o organizing and implemening he DNC memory module. c = C(M, k, β) as defined in Sec and implemened by his projec as sofmax over he marix produc of he normed ensors (M and k) scaled by β [0, ) as shown in Sec As saed in Sec. 3.1, he resul c can be viewed as a probabiliy disribuion over he rows of M based on he similariy of each row (cell) o k as weighed by β. Thus, in Fig. 2, he box Conen-based weighing for wrie is c w = C(M 1, k w, β w ) and he box Conen-based weighing for read is c r,i = C(M, k r,i, β r,i ) i [1 : R] From Fig. 2, observe ha c w, is compued firs in order o compue he final wrie weighing w w o updae (erase and wrie o) he memory, so M 1 M, he updae shown in he dashed lines. Then each c r,i is used o compue each final read weighing wr,i w used o produce each read vecor ri from he updaed memory, M Hisory-based wrie weighing In Fig. 2, he box Hisory-based wrie weighing is also k he process dubbed dynamic memory allocaion by [1] ψ = R (1 f i w r,i 1 ) u = (u 1 + w w 1 (u 1 w w 1)) ψ φ = SorIndicesAscending(u ) j 1 a [φ [j]] = (1 u [φ [j]]) u [φ [j]] As defined in Sec , we have as inpus he R free gaes f i [0, 1] from he inerface vecor, he R previous imesep read weighings w r,i 1 N, he previous ime-sep wrie weighing w 1 w N and he previous ime-sep usage vecor u. The oupu is he allocaion ( hisory -based) wrie weighing a N, which will be convexly combined wih he conen-based wrie weighing c w for he final wrie weighing w w. The inermediae variables are he memory reenion vecor ψ [0, 1] N and indices φ N N of slos sored by usage. The general idea behind hese equaions is o bias he selecion of memory cells (o erase and wrie o) as deermined by a oward hose ha he DNC has recenly read from (ψ ) and away from cells he DNC has recenly wrien o (w 1). w See ha ψ will be lower for memory slos ha have been recenly read, which can be analogous o indicaing ha a memory has been parsed or consumed, and if a high free gae, deermined o be insignifican, so he DNC can forge hose memories and wrie new ones o hose slos; and he converse for slos wih high ψ, indicaing hey should be reained (no erased). Inuiively, he usage vecor u racks memory cells sill being used, i.e. was used (u 1 ), been wrien o (w 1) w and deemed significan (by ψ ), so reained. The only complicaion in he implemenaion was he soring and rearrangemen. We sored using Tensorflow s op k, which gives he indices along wih a sored ensor. Le û denoe he sored u and â be he a in arrangemen by φ. Then (-û, φ ) = op k(-u, k = N) and â is rivially compued from û. The main difficuly is in he rearrangemen since Tensorflow ensors are immuable, one canno preallo- 4

5 cae a ensor and hen index ino i o change is value. Our soluion was o consruc a permuaion marix by urning he indices ino one-ho vecors, so if we had â = [a; b; c] wih φ = [2; 3; 1], hen b a a = c = b = P (φ )â a c where P ( ) urns is vecor inpu ino a permuaion marix. Noe ha his mehod may no be he mos space efficien and may have added o he RAM and compuaional ime of he DNC raining process, bu we could no hink of a beer and cleaner way of implemening his Final wrie weighing and memory wries In Fig. 2, he box Final wrie weighing is w w = g w [g a a + (1 g a )c w ] Since g a [0, 1], observe ha g a a + (1 g a )c w is a convex combinaion of he hisory -based wrie weighing a and he conen-based weighing c w, so if g a 1, hen he DNC ignores he conen-based weighings and vice versa in choosing which memory slos o erase/wrie. The wrie gae g w [0, 1] dicaes wheher o updae a all, so if g w 0, hen no slos will be erased/wrien. As shown in Fig. 2, he wrie weighing w w is used o updae he memory marix, compue he hisory -based read weighings and hen is kep as he new w 1 w for he nex ime-sep. The box Wrie o memory, which is M = M 1 (E w w e ) + w w v complees he wriing porion of he memory module. The erase vecor e [0, 1] W and wrie vecor v R W were unpacked from he inerface vecor, see Sec and E={1} N W, so he equaion can be wrien wihou E as M = M 1 M 1 w w e + w w v }{{}}{{} erase wrie so row j of w w e is he erase vecor scaled by he wrie weighing of memory cell j; similarly for w w v. Noe ha he hadamard produc wih M 1 isolaes erasing from wriing since wihou i, empy memory slos would ge wrien o by he erase insead of leaving a cleaner slo for he wrie. These implemenaions are rivial, so no discussed Hisory-based read weighings Recall from Sec ha he idea behind he hisory-based wrie weighing was o bias he selecion of memory cells (o be erased/wrien o) oward hose ha were a combinaion of being mos recenly read from, leas recenly wrien o, or deemed inconsequenial by he free gaes. In conras, he R hisory-based read weighings selec memory cells (o read from) based on he order in which he cells were wrien o in relaion o he wriing ime of he cells read from in he previous ime-sep, so he preference is for cells wrien o righ before (as measured by backward weighings b i N ) or afer (as measured by forward weighings f i N ) he ime a which he cells he DNC jus read from were wrien o, which is eviden from he equaions for he Fig. 2 box Hisory-based read weighings: ( ) N p = 1 w w [i] p 1 + w w L [i, j] = (1 w w [i] w w [j])l 1 [i, j] + w w [i]p w 1[j] L [i, i] = 0 i [1 : N] f i = L w r,i 1 b i = L w r,i 1 where he precedence weighing p N keeps rack of he degree each memory slo was mos recenly wrien o. Observe ha he DNC updaes p based on how much wriing occurred a he curren ime-sep as measured by he curren wrie weighing w w N, so if w w 0, hen barely any wriing happened a his ime-sep, so p p 1, indicaing ha he wrie hisory is carried over; and if 1 w w 1, he previous precedence is nearly replaces, so p w w. Updaing based on how much wriing happened is also buil ino he recursive equaions for he emporal memory link marix L [0, 1] N, which racks he order in which memory locaions were wrien o such ha each row and column are in N, so 1 L 1 and L 1 1 where is applied elemen-wise. Observe ha w w [i]p w 1[j] is he amoun wrien o memory locaion i a his ime-sep imes he exen locaion j was wrien o recenly, so L [i, j] is he exen memory slo i was wrien o jus afer memory slo j was wrien o in he previous ime-sep. Furher observe ha he more he DNC wries o eiher i or j, he more he DNC updaes L [i, j], so if no much was wrien o hose slos a he curren ime-sep, he previous ime-sep links are mosly carried over. Also observe ha he link marix recursion can be vecorized as ˆL = [ E w w 1 1(w w ) ] L 1 + w w (p w 1) L = ˆL (E I) removes self-links where I is he usual ideniy marix. Noe ha hese equaions are rvially implemened wih Tensorflow broadcasing, so here is no need o acually compue he wo ouer producs of w w wih 1. As explained earlier, ime goes forward from columns o rows in he link marix, so he equaions for f i (propagaing forward once) and b i (propagaing backward once) are inuiive, as are he implemenaions, so neiher is discussed Final read weighing and memory reads Similar o he wrie weighing w w, he i h read weighing w r,i is a convex combinaion of he corresponding conenbased read weighing, c r,i, and he hisory based read weighings, f i and b i, as eviden from he equaions for Fig. 2 box 5

6 Final read weighing w r,i = π i [1]b i + π i [2]c r,i + π i [3]f i where π i S 3 is he read mode vecor ha governs he exen he DNC prioriizes reading from memory slos based on he reverse order hey were wrien (π[1]), i conen similariy (π[2]) i or he order hey were wrien (π[3]). i As seen in Fig. 2, he i h read weighing w r,i is hen passed o he Read from memory box r i = M w r,i hus, producing he i h read vecor r i R W. These implemenaions are rivial, so no discussed DNC final oupu As show in Fig 1, hese R read vecors are hen concaenaed o compue he final DNC oupu y R Y r 1 y = W r. r R + ν as explained in Sec The concaenaed R read vecors are also concaenaed wih he nex inpu x +1 o produce χ +1 o feed ino he NN conroller a he nex ime-sep as explained in Sec These implemenaions are rivial, so no discussed. 4. Experimens and Resuls As explained in Sec. 3, we implemened he DNC as a Tensorflow RNNCell objec, so he DNC can be used as one would use Tensorflow s BasicLSTMCell, which is also an RNNCell, wihou needing o manually implemen he sequence unrolling. This mean he code used o run a DNC raining session can be used o run a session for any RNNCell objec by swiching he RNNCell objec, i.e. from he DNC o an LSTM, which was he baseline model used for all he experimens in his projec. Thus, in all he experimens, an LSTM baseline model was firs run, boh as a reference for he DNC loss curve and as furher correcness verificaion of he daa processing and model raining pipelines Copy experimens A he sar of he experimenaion on he babi daase, he resuls were raher poor and chaoic, so o debug and furher verify ha he DNC was correcly implemened, we decided o experimen wih achieving high performance on he simpler copy asks, which were described in boh he DNC [1] and NTM [12] papers. The smaller and easier o visualize copy asks allowed us o have enough RAM o visualize he memory and emporal link marices, which helped in he fixing of bugs in he implemenaions. Since his projec should be focused primarily on NLP, he experimenaion and resuls from he copy asks, which were all highly successful, are in he appendix in Sec. 6.1 due o he page limis babi experimens The original inen of his projec was o reproduce he babi resuls of [1], however, given he large and complex daase, even hough we were using he GPU, raining he DNC on he full join babi asks was aking more han 15 hours complee even one epoch, which made he full 20 ask join experimen inconceivable given he ime frame and limied compue power. To undersand he bolenecks, we compued saisics over he babi daase as displayed in Fig. 17 and decided ha some asks were jus oo big, e.g. Task 3 he max inpu sequence lenghen of 1920, o be rained in a reasonable ime frame. Therefore, we seleced he smallersized asks from he babi daase, specifically he 6 asks 1, 4, 9, 10, 11, 14, for he join-raining insead of he full 20 asks. For all of he experimens, unless oherwise specified, he models were rained wih a bach size of 16 using RMSPprop wih learning rae 1e-4 and momenum 0.9; and he DNC seings were N = 256, W = 64, R = 4 wih an LSTM conroller wih H = 256. The baseline LSTM model had he same H. The gradiens were clipped by he global norm wih hreshold Daa and merics The babi asks inpus consis of word sequences inerspersed wih quesions wihin self-conained sories and while he daases also included supporing facs ha could be used o srongly supervise he learning, we followed he DNC paper s seings and only considered he weakly supervised seing. For each sory k, we consruced inpu sequences x (k) of one-ho vecors x (k) [0, 1] V where V is he vocabulary size. We reserved a oken - o be he signal ha an answer is required and * o be he padding for sequences ha were shorer han he maximum sequence lengh, which we call T. The arge oupu vecor consised of * for posiions ha do no require answers and he vecorized word answers for he posiions wih - in he inpu sequence, so he maximum sequence lengh of x (k) is also T and x (k) [0, 1] V. For each sequence, a mask m (k) [0, 1] T such ha m (k) [] = 1{answer required a } was also compued and passed o he model wih he inpu and arge oupus in order o ignore he irrelevan prediced oupus a ime-seps where no answers were required. The loss was he average sofmax cross enropy wih logis loss, so he final oupu of boh he DNC cell and he LSTM cell was passed hrough he sofmax funcion, so if h was he cell oupu, hen ŷ = sofmax(h ) as defined in Sec Thus, he loss for a single predicion is L = 1 T T m L(y, ŷ ) =1 V L(y, ŷ ) = y [i] log(ŷ [i]) 6

7 The accuracy is he average number of quesions answered exacly correc, so if a quesion has wo words in he answer, he model has o ge boh words in he righ order o have i marked correc Single asks Before expending he compuaional power o rain he DNC on he join daase, DNC models were rained separaely on single asks boh o verify he hyper-parameers were reasonable and ha he raining pipeline was correcly implemened. Task 1 and ask 15 were chosen for hese experimens. For each of he asks, he dev se was a randomly chosen 10% of he raining se reserved for uning; his was spli raio was also used for he join raining. Figure 3: Plos showing he DNC overfis iny daases (a) Task 1 DNC loss plos full daase. Since he full daase is 1, 4, 9, 10, 11, 14, he size and variey of he join daase may preven over-fiing in he full raining sep Join ask resuls We rained wo DNC models, each of which ook over wo days on he GPU. We rained a DNC model using Adam wih learning rae 1e-3 insead of RMSProp and one using he seings from he DNC paper [1], so wih RMPSprop wih learning rae 1e-4 and momemum 0.9, clipping he gradiens by value o [ 10, 10] insead of by global norm. However, we kep he bachsize of 16 insead of 1, which was in [1]. An LSTM baseline model was also rained. Please see he appendix Fig. 20 for he rain and dev comparison plos for all models. As can be seen in Fig. 4 and Fig. 5, he DNC does slighly beer han he LSTM in erms of loss and accuracy (bach-averaged). Figure 4: DNC vs LSTM join-ask rained models loss plos (b) Task 15 DNC loss plos Figure 5: DNC vs LSTM on join asks. L=loss, A=accuracy, A =accuracy on one bach Train L Dev L Train A Dev A DNC LSTM DNC overfied boh daases as shown in Fig. 19 and Fig. 3, bu he LSTM also overfis hese daases, see Fig. 18 in he appendix, so i may be ha he models are oo complex for hese smaller asks. However, hese experimens were sill useful since a common machine learning saniy check is o ensure a model can overfi a iny subse of he The join-rained DNC and he LSTM models were hen ran on he babi es ses for Tasks 1, 4, 9, 10, 11, 14 and he resuls as displayed in Fig. 5 show ha our DNC model performs beer han or equal o our LSTM baseline on all he asks. The mean accuracy ranges as defined by he sandard deviaions from he DNC paper [1] for he DNC and he LSTM were also provided in addiion o he weakly super- 7

8 Figure 6: DNC vs LSTM join-ask rained models per ask accuracy comparision on he es se Task our DNC our LSTM [1] DNC range [1] LSTM range [25] LSTM 1:single-supporing-fac [1.00,0.78] [0.67,0.53] :wo-arg-relaions [1.00,0.99] [1.00,0.99] :simple-negaion [1.00,0.84] [0.86,0.83] :indefinie-knowledge [1.00,0.79] [0.73,0.70] :basic-coreference [1.00,0.91] [0.91,0.84] :single-supporing-fac [0.96,0.81] [0.45,0.43] 0.27 vised LSTM model resuls from he babi paper [25] for he asks he experimen was run. Recall ha in all our babi experimens, we followed he DNC paper s seings in ha all he models were weakly supervised as he babi daase paper [25] calls i, in ha he models do no use he supporing facs o answer he quesions in he babi asks, so he models received no daa oher han he word sequences. The MemNN models ha achieved near perfec accuracy on he babi daase were using wha [25] ermed srong supervision and no resuls on he weakly supervised ask was provided, so we could no use heir numbers. Observe ha while our LSTM baseline had higher performance han he LSTM baseline from he babi paper [25], quie a few were no in he range of he mean and sandard deviaion of he DNC paper resuls [1] and he same held for our DNC model. We believe his is because we only had ime o rain our models for abou 30 epochs while [1] had he compuing resources o rain all of heir models o compleion in addiion o have 20 randomized models per archiecure ype. The comparisons may also be difficul since our models were only joinly rained on 6 ou of he 20 babi asks while he models in he lieraure were join-rained on all 20. As saed earlier, he we had ried o rain he DNC on all 20, bu afer running for over 15 hours, i had ye o complee even one epoch, le alone he low bar of 30 epochs we were aiming for 5. Conclusion Thus, we have implemened DNCs and conduced experimens wih hem on he copy and babi asks. The copy asks were highly successful and allowed for he visualizaion of he DNC hroughou he learning process, hus also serving as a cerificae of he correcness of he DNC implemenaion. The babi asks were more complex and herefore required more compuaional power han we curren have, which was why a scaled down experimen was conduced ha consis of he join-raining on 6 insead of he full 20 babi asks, each chosen for being smaller as shown in Fig. 17 and herefore faser o rain. Given more ime and compuing power, we would have liked o rain DNC models on he full 20 asks and be able o ierae over he models o ge beer hyper-parameers. Given more RAM, we would have liked o produce he visualizaions of he DNC learning process he way we did for he smaller copy asks as shown in Sec 6.1. We would also have liked o wrie he code o visualize he emporal link marix as he DNC passes hrough one inpu sequence o ge a beer undersanding of he mechanics of he DNC. We would have also liked o do a mahemaical exercise on he link marix equaions, kind of like in Sec. 6.3, o beer undersand he mechanics of is formulaion raher han jus he inuiion. We would also have liked o do more experimens on he single asks, such he he dropou experimen in Sec In conclusion, we were very horough in our documenaion of our approach o he undersanding and implemenaion of DNCs, along wih he challenges we faced in opimizing hem. We hope his projec will be useful for ohers ineresed in DNCs and/or machine learning archiecures wih exernal memory. References [1] G. W. Alex Graves. Hybrid compuing using a neural nework wih dynamic exernal memory. Naure, [2] D. Andor, C. Alberi, D. Weiss, A. Severyn, A. Presa, K. Ganchev, S. Perov, and M. Collins. Globally normalized ransiion-based neural neworks. CoRR, abs/ , [3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine ranslaion by joinly learning o align and ranslae. CoRR, abs/ , [4] D. Chen and C. D. Manning. A fas and accurae dependency parser using neural neworks. In Empirical Mehods in Naural Language Processing (EMNLP), [5] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On he properies of neural machine ranslaion: Encoder-decoder approaches. CoRR, abs/ , [6] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Empirical evaluaion of gaed recurren neural neworks on sequence modeling. CoRR, abs/ , [7] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman. Lip reading senences in he wild. CoRR, abs/ , [8] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Conex-dependen pre-rained deep neural neworks for large-vocabulary speech recogniion. IEEE Transacions on Audio, Speech, and Language Processing, 20(1):30 42, [9] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, hp:// org. [10] A. Graves. Generaing sequences wih recurren neural neworks. CoRR, abs/ ,

9 [11] A. Graves, A. Mohamed, and G. E. Hinon. Speech recogniion wih deep recurren neural neworks. CoRR, abs/ , [12] A. Graves, G. Wayne, and I. Danihelka. Neural uring machines. CoRR, abs/ , [13] S. Hochreier, Y. Bengio, and P. Frasconi. Gradien flow in recurren nes: he difficuly of learning long-erm dependencies. In J. Kolen and S. Kremer, ediors, Field Guide o Dynamical Recurren Neworks. IEEE Press, [14] S. Hochreier and J. Schmidhuber. Long shor-erm memory. Neural Compu., 9(8): , Nov [15] M. Johnson, M. Schuser, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thora, F. B. Viégas, M. Waenberg, G. Corrado, M. Hughes, and J. Dean. Google s mulilingual neural machine ranslaion sysem: Enabling zero-sho ranslaion. CoRR, abs/ , [16] A. Krizhevsky, I. Suskever, and G. E. Hinon. Imagene classificaion wih deep convoluional neural neworks. In F. Pereira, C. J. C. Burges, L. Boou, and K. Q. Weinberger, ediors, Advances in Neural Informaion Processing Sysems 25, pages Curran Associaes, Inc., [17] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani, and R. Socher. Ask me anyhing: Dynamic memory neworks for naural language processing. CoRR, abs/ , [18] R. Pascanu, T. Mikolov, and Y. Bengio. Undersanding he exploding gradien problem. CoRR, abs/ , [19] B. Richardson and Renshaw. Mces: A challenge daase for he open-domain machine comprehension of ex, [20] J. Schmidhuber. Deep learning in neural neworks: An overview. CoRR, abs/ , [21] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. P. Pos. Recursive deep models for semanic composiionaliy over a senimen reebank. In EMNLP, [22] S. Sukhbaaar, a. szlam, J. Weson, and R. Fergus. End-o-end memory neworks. In C. Cores, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garne, ediors, Advances in Neural Informaion Processing Sysems 28, pages Curran Associaes, Inc., [23] I. Suskever, O. Vinyals, and Q. V. Le. Sequence o sequence learning wih neural neworks. CoRR, abs/ , [24] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and ell: A neural image capion generaor. CoRR, abs/ , [25] J. Weson, A. Bordes, S. Chopra, and T. Mikolov. Towards aicomplee quesion answering: A se of prerequisie oy asks. CoRR, abs/ , [26] J. Weson, S. Chopra, and A. Bordes. Memory neworks. CoRR, abs/ , [27] W. Zhang, Y. Yu, and B. Zhou. Srucured memory for neural uring machines. CoRR, abs/ , Appendix 6.1. Copy ask Since DNCs are he descendens of NTMs, hey should have he same capabiliies as NTMs. In he NTM paper, he researchers came up wih a copy ask where given inpus of random bi-vecor sequences such ha he sequence lenghs varied beween 2 and 20, he NTMs were asked wih oupuing a copy of he inpu afer i received he enire inpu sequence and a delimied flag indicaing he inpu has ended and o sar he copy [12]. The rained NTMs were hen given inpus of sequence lengh larger han he sequences on which hey was rained, such as lengh 30, o see if he NTMs can generalize he copy algorihm, which is an experimen ha was no presen in he DNC paper [1] even hough he researchers used he copy ask o verify memory allocaion and he speed of sparse link marices. We figure copy generalizaion would be an ineresing experimen for he DNC. In addiion, he srucural simpliciy of he copy ask also allowed for verificaion of he DNC implemenaion on a iny copy ask where he sequence lenghs were fixed o be 3. Throughou hese experimens, an LSTM served as he baseline model wih hidden layer size H = 100. The DNC models had a FNN as he LSTM as he NN conroller as opposed o a LSTM so he DNC could only sore he memories in is memory marix. All models were rained using RMSPprop wih learning rae 1e-5 and momenum Daa We used seings from boh he DNC paper [1] and he NTM paper [12] since he NTM paper had more exensive copy ask experimens. As in he papers, he inpus consis of a sequence of lengh 6 random binary vecors, so he domain is {0, 1} 6, bu since he las bi is reserved for he delimier flag, d, which ells he model o reproduce ( predic ) he inpu sequence, he model acually receives bi vecors in {0, 1} 7. Call T he maximum sequence lengh, hus he model is fed inpus of sequence lengh 2T+1, which is he same as he arge oupu sequence lengh as depiced in Fig. 7. Figure 7: Example of T = 3 and b = 6 copy daa where x 4 is he delimier, so x 1:3 [1 : 6] is he relevan inpu and y 5:7 [1 : 6] is he relevan oupu. (a) Inpu sequence x (b) Targe oupu sequence y Since he experimens include variable in addiion o fixed sequence lenghs, call T k he sequence lengh of inpu sequence k, hen he inpu is a (2T+1)-lengh sequence x (k) = [x 1 ;... ; x Tk ; d; 0;... ; 0] and he arge oupu is a sequence of he same lengh y (k) = [0;... ; 0; x 1 ;... ; x Tk ; 0;... ; 0] which is beer depiced in Fig. 8. In boh fixed and variable sequence lengh cases, a mask vecor m (k) {0, 1} 2T+1 such 9

10 ha m (k) = 1{y (k) is relevan} is also included o be used in he loss and accuracy funcions. Figure 9: DNC wih 2-layer FNN converges much faser han he LSTM or 1-layer FNN DNC. Figure 8: Example of T = 20 and T k = 5 copy daa where x 6 is he delimier flag. (a) Inpu sequence x (b) Targe oupu sequence y Figure 10: Examples of memory marices hroughou he raining of he DNC on he fixed sequence lengh of T = Loss funcion and merics We rained he model on sigmoid cross enropy loss where irrelevan oupus are masked, so he loss for predicion [ŷ 1 ;... ; ŷ 2T+1 ], arge [y 1 ;... ; y 2T+1 ] wih mask m is L = 1 2T+1 1 m L(ŷ, y ) m =1 L(ŷ, y ) = 1 2T+1 y [i] log ŷ [i]+(1 y [i]) log(1 ŷ [i]) 6 The accuracy of a predicion [ŷ 1 ;... ; ŷ 2T+1 ] is calculaed based on he average number of bi maches wih he arge oupu [y 1 ;... ; y 2T+1 ] using he mask m o ignore irrelevan porions See Fig. 10 for images of he memory marix produced wih Tensorboard. Noe ha hese memory marices are no in any order, hey were jus chosen o show ha he DNC is wriing o specific memory cells (i.e. rows of M) and seems o be erasing ohers (see he ghos rows). Figure 11: Three examples of prefec predicion resuls for he fully rained DNC wih fixed sequence lengh T = 3, so only ŷ 5:7 [1 : 6] is relevan. A = 1 2T m m 6 =1 6 1{ŷ [i] = y [i]} Model checks and opimizaion on iny ask Due o he expensive compuaional requiremens of raining DNCs, as a furher correcness check, we rained he DNC wih inpus of a fixed sequence lengh of T = 3 raher han he full ask of inpus wih variable sequence lengh beween 2 and 20. We used his inier problem as a saniy check on he DNC since i should be able o ge o zero loss on his problem very quickly and he iny ask was also used for hyper-parameer uning along wih esing ou Tensorboard capabiliies. For he iny copy ask, he DNC seings were N = 10, W = 12, R = 1. As observed from he loss plos in Fig. 9, he DNC wih a 1-layer FNN conroller reached zero loss by sep 6k is loss curves closely follows he LSTM baseline, bu a DNC wih a 2-layer FNN conroller reached zero loss by sep 3k. Since i is good pracice o saniy check he sar loss, observe ha since his is basically binary classificaion, he sar loss should be log(0.5) 0.7, which is rue for all models. See Fig. 11 for examples of sample predicions and arges produced wih Tensorboard. Noe ha only he las hree columns of each predicion sequence is relevan since DNC was receiving an inpu of sequence lengh hree followed by a delimier for he firs four ime-seps. Recall ha since raining deep learning models is equivalen o opimizing a non-convex unconsrained opimizaion problem, anoher way o check convergence (graned ha we are no experiencing vanishing gradiens) is o check ha he gradien norms converges o zero, which in he KKT condiions indicae convergence o a local opima. See Fig. 12 for he Tensorboard plo checking ha he models have converged. Observe ha he DNC models show convergence, 10

11 Figure 12: The gradien norms vs ieraion for all hree models for copy ask on T = 3 bu are more unsable wih gradien spikes (over 2000 for he DNC) ha need o be clipped o preven he learning process from geing side-racked by he plaeaus described in [18] [13]. Figure 14: Sampling of predicion resuls from DNC hroughou raining on varied seq. lens in [2,20]. (a) Early-raining Full copy ask For he full copy ask, he DNC seings were N = 20, W = 12, R = 1. Tensorboard was used o display images of he predicions and arges hroughou he DNC learning process, which is shown in Fig. 14. For he same number of ieraions, he learning process for he DNC ook 8 hours and he LSTM ook half an hour o raining on he CPU (GPU was already being used) using RMSPprop wih learning rae 1e-5 and momenum 0.9. Figure 13: DNC vs LSTM loss plos for seq lens [2,20] no surpass 5. (b) Mid-raining (c) Lae-raining Figure 15: The gradien norms vs ieraion for he 2-layer DNC rained on he variable lengh copy ask Figure 16: DNC vs LSTM resuls The loss plos in Fig 13 show ha he DNC learns faser and beer han he LSTM, bu also ha he DNC learning process is very unsable as here ends o be huge spikes in he loss plos even hough he gradiens have been clipped by he global norm, as observed in Fig. 15 where he norms do Model Loss on T k [2, 20] Acc on T k = 30 LSTM DNC e To es how well he models generalize he copy ask, hey 11

12 were esed on a bach of 100 inpu sequences where he sequence lengh was T = 30. The resuls in Fig. 16 show ha he DNC can beer generalize he copy ask han he LSTM, which was a confirmaion of our hypohesis DNC equaions and definiions from [1] Glossary N = {α R N α i [0, 1], 1 α 1} S N = {α R N α i [0, 1], 1 α = 1} Conroller Updae χ = [x ; r 1 1; ; r R 1] (ξ, ν ) = N ([χ 1 ; ; χ ]; θ) where N ( ) is he neural nework based conroller ha consiss of a cell ha in his projec was eiher a FNN or LSTM ha oupus h as a funcion of he inpu χ, and h 1 if an LSTM. So if cell is a FNN, i has he form: h = relu(w χ χ + b χ ) oherwise, cell is he LSTM of he form: i = σ(w i [χ ; h 1 ] + b i ) f = σ(w f [χ ; h 1 ] + b f ) s = f s 1 + i anh(w s [χ ; h 1 ] + b s ) o = σ(w o [χ ; h 1 ] + b o ) h = o anh(s ) A linear operaion is used o ge (ξ, ν ): [ ξ ] = W ν h h Inerface (ξ ) unpacking Definiions 1 σ(x) = 1 + e x [0, 1] oneplus(x) = 1 + log(1 + e x ) [1, ) e xi sofmax(x) i = x [0, 1] j=1 exj C(M, k, β)[i] = D(u, v) = u v u v exp{d(k, M[i, ]) β} j exp{d(k, M[i, ]) β} S N cosine similariy (A B)[i, j] = A[i, j] B[i, j] hadamard produc (x y)[i, j] = x[i] y[i] hadamard produc Iniial Condiions Spli he vecor ξ R (W R)+3W +5R+3 ino he following componens, hen use he uiliy funcions in Sec o preprocess some of he componens. ξ = k r,1 ṭ ˆπ R. k r,r ˆβ r,1 ṭ. ˆβ r,r k w ˆβ w ê v ˆf ṭ 1. ˆf R ĝ a ĝ w ˆπ Memory Updaes read keys kr,i R W read srenghs βr,i = oneplus( r,i ˆβ ) R } wrie key k w R W } wrie srengh β w = oneplus( ˆβ w ) R } erase vecor e = σ(ê ) R W } wrie vecor v R W free gaes f i = σ( ˆf i ) R } allocaion gae g a = σ(ĝ a ) R } wrie gae g w = σ(ĝ w ) R read modes πi = sofmax(ˆπ ) i R 3 u 0 = 0; p 0 = 0; L 0 = 0; L [i, i] = 0, i The original paper equaions: 12

13 R ψ = (1 f i w r,i 1 ) u = (u 1 + w w 1 (u 1 w w 1)) ψ φ = SorIndicesAscending(u ) j 1 a [φ [j]] = (1 u [φ [j]]) u [φ [j]] c w = C(M 1, k w, β w ) w w = g w [g a a + (1 g a )c w ] M = M 1 (E w w e ) + w w v ( ) N p = 1 w w [i] p 1 + w w L [i, j] = (1 w w [i] w w [j])l 1 [i, j] + w w [i]p w 1[j] f i = L w r,i 1 b i = L w r,i 1 c r,i w r,i = C(M, k r,i, β r,i ) = π i [1]b i + π i [2]c r,i r i = M w r,i + π i [3]f i 6.4. babi experimens, supplemenary info Figure 17: babi daa asks saisics. Task vocab_size maxlenx max#q qa1_single-supporing-fac qa2_wo-supporing-facs qa3_hree-supporing-facs qa4_wo-arg-relaions qa5_hree-arg-relaions qa6_yes-no-quesions qa7_couning qa8_liss-ses qa9_simple-negaion qa10_indefinie-knowledge qa11_basic-coreference qa12_conjuncion qa13_compound-coreference qa14_ime-reasoning qa15_basic-deducion qa16_basic-inducion qa17_posiional-reasoning qa18_size-reasoning qa19_pah-finding qa20_agens-moivaions Figure 18: Loss plos showing LSTM ovefiing Task 1 where dev acc was and dev loss was Oupu The final oupu is a linear combinaion of he vecor ν from he conroller and he concaenaed read vecors r 1,..., r R. r 1 y = W r. r R + ν 6.3. DNC mahemaical exercise In mos of he DNC equaions, why he auhors of he DNC paper [1] formulaed he expressions hey way hey did was inuiive, bu some of he equaions were no quie so. In our undersanding of he DNC equaions, we wen hrough some proofs o ensure we undersood he why behind he mahemaical expressions, paricularly for he usage vecor equaion, which, hrough our mahemaical exercise, we hink was formulaed o ensure u [0, 1] N. To see ha u = (u 1 +w w 1 (u 1 w w 1)) ψ [0, 1] N, observe ha a + b ab [0, 1] if a, b [0, 1]. a + b ab = a + b ab = (1 b)a + b = (1 b)a (1 b) + 1 = (1 b)(a 1) + 1 Observe ha since a, b [0, 1], 1 a 1 0 and 0 1 b 1, so 1 (1 b)(a 1) 0, which implies 0 (1 b)(a 1) Thus, by he same logic, u 1 + w w 1 (u 1 w w 1) [0, 1] and since ψ [0, 1], we mus have u [0, 1]. Figure 19: DNC overfiing on single ask experimens Train Loss Dev loss Train Acc Dev Acc Task Task

14 Figure 20: Join-ask models: Training vs dev se loss and accuracy plos Dropou experimen (afer join models) While wriing he conclusion, we had lef-over GPU ime, so due o he over-fiing on he single asks, we also ran a dropou experimen o undersand wha we would have done if we had more ime, bu he dropou experimen was only on a single ask due o compuaional and emporal limiaions. We were hypohesizing ha he inroducion of regularizaion would help he model beer generalize o unseen daa. The dropou experimen was only execued on Task 1 and he resuls were somewha promising as shown in Fig. 21 since, like in he experimen wihou dropou, he accuracy on he dev se sars fla-lining a abou 0.5, bu he difference beween he dev and rain loss plos were less drasic. However, he rain loss for he DNC wih dropou was higher han he DNC alone on all ieraions and he DNC wih dropou was aking more han wice as long o rain o ge o a higher loss han he DNC wihou dropou. Figure 21: DNC rained on Task 1 wih Dropou (a) DNC wih [1] seings (b) DNC using Adam: noice he overfiing. (c) LSTM 14

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks - Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics

More information

Ensamble methods: Bagging and Boosting

Ensamble methods: Bagging and Boosting Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par

More information

Ensamble methods: Boosting

Ensamble methods: Boosting Lecure 21 Ensamble mehods: Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Schedule Final exam: April 18: 1:00-2:15pm, in-class Term projecs April 23 & April 25: a 1:00-2:30pm in CS seminar room

More information

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still. Lecure - Kinemaics in One Dimension Displacemen, Velociy and Acceleraion Everyhing in he world is moving. Nohing says sill. Moion occurs a all scales of he universe, saring from he moion of elecrons in

More information

Online Convex Optimization Example And Follow-The-Leader

Online Convex Optimization Example And Follow-The-Leader CSE599s, Spring 2014, Online Learning Lecure 2-04/03/2014 Online Convex Opimizaion Example And Follow-The-Leader Lecurer: Brendan McMahan Scribe: Sephen Joe Jonany 1 Review of Online Convex Opimizaion

More information

Learning to Process Natural Language in Big Data Environment

Learning to Process Natural Language in Big Data Environment CCF ADL 2015 Nanchang Oc 11, 2015 Learning o Process Naural Language in Big Daa Environmen Hang Li Noah s Ark Lab Huawei Technologies Par 2: Useful Deep Learning Tools Powerful Deep Learning Tools (Unsupervised

More information

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 175 CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK 10.1 INTRODUCTION Amongs he research work performed, he bes resuls of experimenal work are validaed wih Arificial Neural Nework. From he

More information

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes Represening Periodic Funcions by Fourier Series 3. Inroducion In his Secion we show how a periodic funcion can be expressed as a series of sines and cosines. We begin by obaining some sandard inegrals

More information

Two Coupled Oscillators / Normal Modes

Two Coupled Oscillators / Normal Modes Lecure 3 Phys 3750 Two Coupled Oscillaors / Normal Modes Overview and Moivaion: Today we ake a small, bu significan, sep owards wave moion. We will no ye observe waves, bu his sep is imporan in is own

More information

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD HAN XIAO 1. Penalized Leas Squares Lasso solves he following opimizaion problem, ˆβ lasso = arg max β R p+1 1 N y i β 0 N x ij β j β j (1.1) for some 0.

More information

Let us start with a two dimensional case. We consider a vector ( x,

Let us start with a two dimensional case. We consider a vector ( x, Roaion marices We consider now roaion marices in wo and hree dimensions. We sar wih wo dimensions since wo dimensions are easier han hree o undersand, and one dimension is a lile oo simple. However, our

More information

Matlab and Python programming: how to get started

Matlab and Python programming: how to get started Malab and Pyhon programming: how o ge sared Equipping readers he skills o wrie programs o explore complex sysems and discover ineresing paerns from big daa is one of he main goals of his book. In his chaper,

More information

Final Spring 2007

Final Spring 2007 .615 Final Spring 7 Overview The purpose of he final exam is o calculae he MHD β limi in a high-bea oroidal okamak agains he dangerous n = 1 exernal ballooning-kink mode. Effecively, his corresponds o

More information

GMM - Generalized Method of Moments

GMM - Generalized Method of Moments GMM - Generalized Mehod of Momens Conens GMM esimaion, shor inroducion 2 GMM inuiion: Maching momens 2 3 General overview of GMM esimaion. 3 3. Weighing marix...........................................

More information

1 Review of Zero-Sum Games

1 Review of Zero-Sum Games COS 5: heoreical Machine Learning Lecurer: Rob Schapire Lecure #23 Scribe: Eugene Brevdo April 30, 2008 Review of Zero-Sum Games Las ime we inroduced a mahemaical model for wo player zero-sum games. Any

More information

STATE-SPACE MODELLING. A mass balance across the tank gives:

STATE-SPACE MODELLING. A mass balance across the tank gives: B. Lennox and N.F. Thornhill, 9, Sae Space Modelling, IChemE Process Managemen and Conrol Subjec Group Newsleer STE-SPACE MODELLING Inroducion: Over he pas decade or so here has been an ever increasing

More information

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED 0.1 MAXIMUM LIKELIHOOD ESTIMATIO EXPLAIED Maximum likelihood esimaion is a bes-fi saisical mehod for he esimaion of he values of he parameers of a sysem, based on a se of observaions of a random variable

More information

Linear Response Theory: The connection between QFT and experiments

Linear Response Theory: The connection between QFT and experiments Phys540.nb 39 3 Linear Response Theory: The connecion beween QFT and experimens 3.1. Basic conceps and ideas Q: How do we measure he conduciviy of a meal? A: we firs inroduce a weak elecric field E, and

More information

Longest Common Prefixes

Longest Common Prefixes Longes Common Prefixes The sandard ordering for srings is he lexicographical order. I is induced by an order over he alphabe. We will use he same symbols (,

More information

Some Basic Information about M-S-D Systems

Some Basic Information about M-S-D Systems Some Basic Informaion abou M-S-D Sysems 1 Inroducion We wan o give some summary of he facs concerning unforced (homogeneous) and forced (non-homogeneous) models for linear oscillaors governed by second-order,

More information

Chapter 2. First Order Scalar Equations

Chapter 2. First Order Scalar Equations Chaper. Firs Order Scalar Equaions We sar our sudy of differenial equaions in he same way he pioneers in his field did. We show paricular echniques o solve paricular ypes of firs order differenial equaions.

More information

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis Speaker Adapaion Techniques For Coninuous Speech Using Medium and Small Adapaion Daa Ses Consaninos Boulis Ouline of he Presenaion Inroducion o he speaker adapaion problem Maximum Likelihood Sochasic Transformaions

More information

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8) I. Definiions and Problems A. Perfec Mulicollineariy Econ7 Applied Economerics Topic 7: Mulicollineariy (Sudenmund, Chaper 8) Definiion: Perfec mulicollineariy exiss in a following K-variable regression

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part V Language Models, RNN, GRU and LSTM 2 Winter 2019

CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part V Language Models, RNN, GRU and LSTM 2 Winter 2019 CS224n: Naural Language Processing wih Deep Learning 1 Lecure Noes: Par V Language Models, RNN, GRU and LSTM 2 Winer 2019 1 Course Insrucors: Chrisopher Manning, Richard Socher 2 Auhors: Milad Mohammadi,

More information

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t... Mah 228- Fri Mar 24 5.6 Marix exponenials and linear sysems: The analogy beween firs order sysems of linear differenial equaions (Chaper 5) and scalar linear differenial equaions (Chaper ) is much sronger

More information

Learning Objectives: Practice designing and simulating digital circuits including flip flops Experience state machine design procedure

Learning Objectives: Practice designing and simulating digital circuits including flip flops Experience state machine design procedure Lab 4: Synchronous Sae Machine Design Summary: Design and implemen synchronous sae machine circuis and es hem wih simulaions in Cadence Viruoso. Learning Objecives: Pracice designing and simulaing digial

More information

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power Alpaydin Chaper, Michell Chaper 7 Alpaydin slides are in urquoise. Ehem Alpaydin, copyrigh: The MIT Press, 010. alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/ ehem/imle All oher slides are based on Michell.

More information

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17

Designing Information Devices and Systems I Spring 2019 Lecture Notes Note 17 EES 16A Designing Informaion Devices and Sysems I Spring 019 Lecure Noes Noe 17 17.1 apaciive ouchscreen In he las noe, we saw ha a capacior consiss of wo pieces on conducive maerial separaed by a nonconducive

More information

Lecture Notes 2. The Hilbert Space Approach to Time Series

Lecture Notes 2. The Hilbert Space Approach to Time Series Time Series Seven N. Durlauf Universiy of Wisconsin. Basic ideas Lecure Noes. The Hilber Space Approach o Time Series The Hilber space framework provides a very powerful language for discussing he relaionship

More information

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model

Retrieval Models. Boolean and Vector Space Retrieval Models. Common Preprocessing Steps. Boolean Model. Boolean Retrieval Model 1 Boolean and Vecor Space Rerieval Models Many slides in his secion are adaped from Prof. Joydeep Ghosh (UT ECE) who in urn adaped hem from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) Rerieval

More information

10. State Space Methods

10. State Space Methods . Sae Space Mehods. Inroducion Sae space modelling was briefly inroduced in chaper. Here more coverage is provided of sae space mehods before some of heir uses in conrol sysem design are covered in he

More information

Vehicle Arrival Models : Headway

Vehicle Arrival Models : Headway Chaper 12 Vehicle Arrival Models : Headway 12.1 Inroducion Modelling arrival of vehicle a secion of road is an imporan sep in raffic flow modelling. I has imporan applicaion in raffic flow simulaion where

More information

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details! MAT 257, Handou 6: Ocober 7-2, 20. I. Assignmen. Finish reading Chaper 2 of Spiva, rereading earlier secions as necessary. handou and fill in some missing deails! II. Higher derivaives. Also, read his

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Probabilisic reasoning over ime So far, we ve mosly deal wih episodic environmens Excepions: games wih muliple moves, planning In paricular, he Bayesian neworks we ve seen so far describe

More information

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles Diebold, Chaper 7 Francis X. Diebold, Elemens of Forecasing, 4h Ediion (Mason, Ohio: Cengage Learning, 006). Chaper 7. Characerizing Cycles Afer compleing his reading you should be able o: Define covariance

More information

Isolated-word speech recognition using hidden Markov models

Isolated-word speech recognition using hidden Markov models Isolaed-word speech recogniion using hidden Markov models Håkon Sandsmark December 18, 21 1 Inroducion Speech recogniion is a challenging problem on which much work has been done he las decades. Some of

More information

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates Biol. 356 Lab 8. Moraliy, Recruimen, and Migraion Raes (modified from Cox, 00, General Ecology Lab Manual, McGraw Hill) Las week we esimaed populaion size hrough several mehods. One assumpion of all hese

More information

Solutions from Chapter 9.1 and 9.2

Solutions from Chapter 9.1 and 9.2 Soluions from Chaper 9 and 92 Secion 9 Problem # This basically boils down o an exercise in he chain rule from calculus We are looking for soluions of he form: u( x) = f( k x c) where k x R 3 and k is

More information

Random Walk with Anti-Correlated Steps

Random Walk with Anti-Correlated Steps Random Walk wih Ani-Correlaed Seps John Noga Dirk Wagner 2 Absrac We conjecure he expeced value of random walks wih ani-correlaed seps o be exacly. We suppor his conjecure wih 2 plausibiliy argumens and

More information

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle Chaper 2 Newonian Mechanics Single Paricle In his Chaper we will review wha Newon s laws of mechanics ell us abou he moion of a single paricle. Newon s laws are only valid in suiable reference frames,

More information

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing Applicaion of a Sochasic-Fuzzy Approach o Modeling Opimal Discree Time Dynamical Sysems by Using Large Scale Daa Processing AA WALASZE-BABISZEWSA Deparmen of Compuer Engineering Opole Universiy of Technology

More information

Notes on Kalman Filtering

Notes on Kalman Filtering Noes on Kalman Filering Brian Borchers and Rick Aser November 7, Inroducion Daa Assimilaion is he problem of merging model predicions wih acual measuremens of a sysem o produce an opimal esimae of he curren

More information

Math Week 14 April 16-20: sections first order systems of linear differential equations; 7.4 mass-spring systems.

Math Week 14 April 16-20: sections first order systems of linear differential equations; 7.4 mass-spring systems. Mah 2250-004 Week 4 April 6-20 secions 7.-7.3 firs order sysems of linear differenial equaions; 7.4 mass-spring sysems. Mon Apr 6 7.-7.2 Sysems of differenial equaions (7.), and he vecor Calculus we need

More information

Single and Double Pendulum Models

Single and Double Pendulum Models Single and Double Pendulum Models Mah 596 Projec Summary Spring 2016 Jarod Har 1 Overview Differen ypes of pendulums are used o model many phenomena in various disciplines. In paricular, single and double

More information

2. Nonlinear Conservation Law Equations

2. Nonlinear Conservation Law Equations . Nonlinear Conservaion Law Equaions One of he clear lessons learned over recen years in sudying nonlinear parial differenial equaions is ha i is generally no wise o ry o aack a general class of nonlinear

More information

Topic Astable Circuits. Recall that an astable circuit has two unstable states;

Topic Astable Circuits. Recall that an astable circuit has two unstable states; Topic 2.2. Asable Circuis. Learning Objecives: A he end o his opic you will be able o; Recall ha an asable circui has wo unsable saes; Explain he operaion o a circui based on a Schmi inverer, and esimae

More information

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions Muli-Period Sochasic Models: Opimali of (s, S) Polic for -Convex Objecive Funcions Consider a seing similar o he N-sage newsvendor problem excep ha now here is a fixed re-ordering cos (> 0) for each (re-)order.

More information

Math From Scratch Lesson 34: Isolating Variables

Math From Scratch Lesson 34: Isolating Variables Mah From Scrach Lesson 34: Isolaing Variables W. Blaine Dowler July 25, 2013 Conens 1 Order of Operaions 1 1.1 Muliplicaion and Addiion..................... 1 1.2 Division and Subracion.......................

More information

More Digital Logic. t p output. Low-to-high and high-to-low transitions could have different t p. V in (t)

More Digital Logic. t p output. Low-to-high and high-to-low transitions could have different t p. V in (t) EECS 4 Spring 23 Lecure 2 EECS 4 Spring 23 Lecure 2 More igial Logic Gae delay and signal propagaion Clocked circui elemens (flip-flop) Wriing a word o memory Simplifying digial circuis: Karnaugh maps

More information

3.1 More on model selection

3.1 More on model selection 3. More on Model selecion 3. Comparing models AIC, BIC, Adjused R squared. 3. Over Fiing problem. 3.3 Sample spliing. 3. More on model selecion crieria Ofen afer model fiing you are lef wih a handful of

More information

KINEMATICS IN ONE DIMENSION

KINEMATICS IN ONE DIMENSION KINEMATICS IN ONE DIMENSION PREVIEW Kinemaics is he sudy of how hings move how far (disance and displacemen), how fas (speed and velociy), and how fas ha how fas changes (acceleraion). We say ha an objec

More information

Experiments on logistic regression

Experiments on logistic regression Experimens on logisic regression Ning Bao March, 8 Absrac In his repor, several experimens have been conduced on a spam daa se wih Logisic Regression based on Gradien Descen approach. Firs, he overfiing

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DOI: 0.038/NCLIMATE893 Temporal resoluion and DICE * Supplemenal Informaion Alex L. Maren and Sephen C. Newbold Naional Cener for Environmenal Economics, US Environmenal Proecion

More information

EECE251. Circuit Analysis I. Set 4: Capacitors, Inductors, and First-Order Linear Circuits

EECE251. Circuit Analysis I. Set 4: Capacitors, Inductors, and First-Order Linear Circuits EEE25 ircui Analysis I Se 4: apaciors, Inducors, and Firs-Order inear ircuis Shahriar Mirabbasi Deparmen of Elecrical and ompuer Engineering Universiy of Briish olumbia shahriar@ece.ubc.ca Overview Passive

More information

Object tracking: Using HMMs to estimate the geographical location of fish

Object tracking: Using HMMs to estimate the geographical location of fish Objec racking: Using HMMs o esimae he geographical locaion of fish 02433 - Hidden Markov Models Marin Wæver Pedersen, Henrik Madsen Course week 13 MWP, compiled June 8, 2011 Objecive: Locae fish from agging

More information

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN The MIT Press, 2014 Lecure Slides for INTRODUCTION TO MACHINE LEARNING 3RD EDITION alpaydin@boun.edu.r hp://www.cmpe.boun.edu.r/~ehem/i2ml3e CHAPTER 2: SUPERVISED LEARNING Learning a Class

More information

From Complex Fourier Series to Fourier Transforms

From Complex Fourier Series to Fourier Transforms Topic From Complex Fourier Series o Fourier Transforms. Inroducion In he previous lecure you saw ha complex Fourier Series and is coeciens were dened by as f ( = n= C ne in! where C n = T T = T = f (e

More information

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality Marix Versions of Some Refinemens of he Arihmeic-Geomeric Mean Inequaliy Bao Qi Feng and Andrew Tonge Absrac. We esablish marix versions of refinemens due o Alzer ], Carwrigh and Field 4], and Mercer 5]

More information

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon

3.1.3 INTRODUCTION TO DYNAMIC OPTIMIZATION: DISCRETE TIME PROBLEMS. A. The Hamiltonian and First-Order Conditions in a Finite Time Horizon 3..3 INRODUCION O DYNAMIC OPIMIZAION: DISCREE IME PROBLEMS A. he Hamilonian and Firs-Order Condiions in a Finie ime Horizon Define a new funcion, he Hamilonian funcion, H. H he change in he oal value of

More information

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course

Overview. COMP14112: Artificial Intelligence Fundamentals. Lecture 0 Very Brief Overview. Structure of this course OMP: Arificial Inelligence Fundamenals Lecure 0 Very Brief Overview Lecurer: Email: Xiao-Jun Zeng x.zeng@mancheser.ac.uk Overview This course will focus mainly on probabilisic mehods in AI We shall presen

More information

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models.

Technical Report Doc ID: TR March-2013 (Last revision: 23-February-2016) On formulating quadratic functions in optimization models. Technical Repor Doc ID: TR--203 06-March-203 (Las revision: 23-Februar-206) On formulaing quadraic funcions in opimizaion models. Auhor: Erling D. Andersen Convex quadraic consrains quie frequenl appear

More information

Simplified Gating in Long Short-term Memory (LSTM) Recurrent Neural Networks

Simplified Gating in Long Short-term Memory (LSTM) Recurrent Neural Networks Simplified Gaing in Long Shor-erm Memory (LSTM) Recurren Neural Neworks Yuzhen Lu and Fahi M. Salem Circuis, Sysems, and Neural Neworks (CSANN) Lab Deparmen of Biosysems and Agriculural Engineering Deparmen

More information

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes Some common engineering funcions 2.7 Inroducion This secion provides a caalogue of some common funcions ofen used in Science and Engineering. These include polynomials, raional funcions, he modulus funcion

More information

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model Modal idenificaion of srucures from roving inpu daa by means of maximum likelihood esimaion of he sae space model J. Cara, J. Juan, E. Alarcón Absrac The usual way o perform a forced vibraion es is o fix

More information

Math 10B: Mock Mid II. April 13, 2016

Math 10B: Mock Mid II. April 13, 2016 Name: Soluions Mah 10B: Mock Mid II April 13, 016 1. ( poins) Sae, wih jusificaion, wheher he following saemens are rue or false. (a) If a 3 3 marix A saisfies A 3 A = 0, hen i canno be inverible. True.

More information

An introduction to the theory of SDDP algorithm

An introduction to the theory of SDDP algorithm An inroducion o he heory of SDDP algorihm V. Leclère (ENPC) Augus 1, 2014 V. Leclère Inroducion o SDDP Augus 1, 2014 1 / 21 Inroducion Large scale sochasic problem are hard o solve. Two ways of aacking

More information

Seminar 4: Hotelling 2

Seminar 4: Hotelling 2 Seminar 4: Hoelling 2 November 3, 211 1 Exercise Par 1 Iso-elasic demand A non renewable resource of a known sock S can be exraced a zero cos. Demand for he resource is of he form: D(p ) = p ε ε > A a

More information

Removing Useless Productions of a Context Free Grammar through Petri Net

Removing Useless Productions of a Context Free Grammar through Petri Net Journal of Compuer Science 3 (7): 494-498, 2007 ISSN 1549-3636 2007 Science Publicaions Removing Useless Producions of a Conex Free Grammar hrough Peri Ne Mansoor Al-A'ali and Ali A Khan Deparmen of Compuer

More information

Distributed Deep Learning Parallel Sparse Autoencoder. 2 Serial Sparse Autoencoder. 1 Introduction. 2.1 Stochastic Gradient Descent

Distributed Deep Learning Parallel Sparse Autoencoder. 2 Serial Sparse Autoencoder. 1 Introduction. 2.1 Stochastic Gradient Descent Disribued Deep Learning Parallel Sparse Auoencoder Inroducion Abhik Lahiri Raghav Pasari Bobby Prochnow December 0, 00 Much of he bleeding edge research in he areas of compuer vision, naural language processing,

More information

Lecture 9: September 25

Lecture 9: September 25 0-725: Opimizaion Fall 202 Lecure 9: Sepember 25 Lecurer: Geoff Gordon/Ryan Tibshirani Scribes: Xuezhi Wang, Subhodeep Moira, Abhimanu Kumar Noe: LaTeX emplae couresy of UC Berkeley EECS dep. Disclaimer:

More information

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9) CSE/NB 528 Lecure 14: Reinforcemen Learning Chaper 9 Image from hp://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg Lecure figures are from Dayan & Abbo s book hp://people.brandeis.edu/~abbo/book/index.hml

More information

!!"#"$%&#'()!"#&'(*%)+,&',-)./0)1-*23)

!!#$%&#'()!#&'(*%)+,&',-)./0)1-*23) "#"$%&#'()"#&'(*%)+,&',-)./)1-*) #$%&'()*+,&',-.%,/)*+,-&1*#$)()5*6$+$%*,7&*-'-&1*(,-&*6&,7.$%$+*&%'(*8$&',-,%'-&1*(,-&*6&,79*(&,%: ;..,*&1$&$.$%&'()*1$$.,'&',-9*(&,%)?%*,('&5

More information

SOLUTIONS TO ECE 3084

SOLUTIONS TO ECE 3084 SOLUTIONS TO ECE 384 PROBLEM 2.. For each sysem below, specify wheher or no i is: (i) memoryless; (ii) causal; (iii) inverible; (iv) linear; (v) ime invarian; Explain your reasoning. If he propery is no

More information

Linear Time-invariant systems, Convolution, and Cross-correlation

Linear Time-invariant systems, Convolution, and Cross-correlation Linear Time-invarian sysems, Convoluion, and Cross-correlaion (1) Linear Time-invarian (LTI) sysem A sysem akes in an inpu funcion and reurns an oupu funcion. x() T y() Inpu Sysem Oupu y() = T[x()] An

More information

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecure 7: Eligibiliy Traces R. S. Suon and A. G. Baro: Reinforcemen Learning: An Inroducion 1 N-sep TD Predicion Idea: Look farher ino he fuure when you do TD backup (1, 2, 3,, n seps) R. S. Suon and

More information

Spring Ammar Abu-Hudrouss Islamic University Gaza

Spring Ammar Abu-Hudrouss Islamic University Gaza Chaper 7 Reed-Solomon Code Spring 9 Ammar Abu-Hudrouss Islamic Universiy Gaza ١ Inroducion A Reed Solomon code is a special case of a BCH code in which he lengh of he code is one less han he size of he

More information

Presentation Overview

Presentation Overview Acion Refinemen in Reinforcemen Learning by Probabiliy Smoohing By Thomas G. Dieerich & Didac Busques Speaer: Kai Xu Presenaion Overview Bacground The Probabiliy Smoohing Mehod Experimenal Sudy of Acion

More information

Lecture 33: November 29

Lecture 33: November 29 36-705: Inermediae Saisics Fall 2017 Lecurer: Siva Balakrishnan Lecure 33: November 29 Today we will coninue discussing he boosrap, and hen ry o undersand why i works in a simple case. In he las lecure

More information

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Article from. Predictive Analytics and Futurism. July 2016 Issue 13 Aricle from Predicive Analyics and Fuurism July 6 Issue An Inroducion o Incremenal Learning By Qiang Wu and Dave Snell Machine learning provides useful ools for predicive analyics The ypical machine learning

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Noes for EE7C Spring 018: Convex Opimizaion and Approximaion Insrucor: Moriz Hard Email: hard+ee7c@berkeley.edu Graduae Insrucor: Max Simchowiz Email: msimchow+ee7c@berkeley.edu Ocober 15, 018 3

More information

Written HW 9 Sol. CS 188 Fall Introduction to Artificial Intelligence

Written HW 9 Sol. CS 188 Fall Introduction to Artificial Intelligence CS 188 Fall 2018 Inroducion o Arificial Inelligence Wrien HW 9 Sol. Self-assessmen due: Tuesday 11/13/2018 a 11:59pm (submi via Gradescope) For he self assessmen, fill in he self assessmen boxes in your

More information

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé Bias in Condiional and Uncondiional Fixed Effecs Logi Esimaion: a Correcion * Tom Coupé Economics Educaion and Research Consorium, Naional Universiy of Kyiv Mohyla Academy Address: Vul Voloska 10, 04070

More information

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important on-parameric echniques Insance Based Learning AKA: neares neighbor mehods, non-parameric, lazy, memorybased, or case-based learning Copyrigh 2005 by David Helmbold 1 Do no fi a model (as do LDA, logisic

More information

Lab #2: Kinematics in 1-Dimension

Lab #2: Kinematics in 1-Dimension Reading Assignmen: Chaper 2, Secions 2-1 hrough 2-8 Lab #2: Kinemaics in 1-Dimension Inroducion: The sudy of moion is broken ino wo main areas of sudy kinemaics and dynamics. Kinemaics is he descripion

More information

The expectation value of the field operator.

The expectation value of the field operator. The expecaion value of he field operaor. Dan Solomon Universiy of Illinois Chicago, IL dsolom@uic.edu June, 04 Absrac. Much of he mahemaical developmen of quanum field heory has been in suppor of deermining

More information

Self assessment due: Monday 4/29/2019 at 11:59pm (submit via Gradescope)

Self assessment due: Monday 4/29/2019 at 11:59pm (submit via Gradescope) CS 188 Spring 2019 Inroducion o Arificial Inelligence Wrien HW 10 Due: Monday 4/22/2019 a 11:59pm (submi via Gradescope). Leave self assessmen boxes blank for his due dae. Self assessmen due: Monday 4/29/2019

More information

Notes on online convex optimization

Notes on online convex optimization Noes on online convex opimizaion Karl Sraos Online convex opimizaion (OCO) is a principled framework for online learning: OnlineConvexOpimizaion Inpu: convex se S, number of seps T For =, 2,..., T : Selec

More information

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t Exercise 7 C P = α + β R P + u C = αp + βr + v (a) (b) C R = α P R + β + w (c) Assumpions abou he disurbances u, v, w : Classical assumions on he disurbance of one of he equaions, eg. on (b): E(v v s P,

More information

5. Stochastic processes (1)

5. Stochastic processes (1) Lec05.pp S-38.45 - Inroducion o Teleraffic Theory Spring 2005 Conens Basic conceps Poisson process 2 Sochasic processes () Consider some quaniy in a eleraffic (or any) sysem I ypically evolves in ime randomly

More information

20. Applications of the Genetic-Drift Model

20. Applications of the Genetic-Drift Model 0. Applicaions of he Geneic-Drif Model 1) Deermining he probabiliy of forming any paricular combinaion of genoypes in he nex generaion: Example: If he parenal allele frequencies are p 0 = 0.35 and q 0

More information

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature On Measuring Pro-Poor Growh 1. On Various Ways of Measuring Pro-Poor Growh: A Shor eview of he Lieraure During he pas en years or so here have been various suggesions concerning he way one should check

More information

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Georey E. Hinton. University oftoronto.   Technical Report CRG-TR February 22, Abstract Parameer Esimaion for Linear Dynamical Sysems Zoubin Ghahramani Georey E. Hinon Deparmen of Compuer Science Universiy oftorono 6 King's College Road Torono, Canada M5S A4 Email: zoubin@cs.orono.edu Technical

More information

Chapter 6. Systems of First Order Linear Differential Equations

Chapter 6. Systems of First Order Linear Differential Equations Chaper 6 Sysems of Firs Order Linear Differenial Equaions We will only discuss firs order sysems However higher order sysems may be made ino firs order sysems by a rick shown below We will have a sligh

More information

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017 Two Popular Bayesian Esimaors: Paricle and Kalman Filers McGill COMP 765 Sep 14 h, 2017 1 1 1, dx x Bel x u x P x z P Recall: Bayes Filers,,,,,,, 1 1 1 1 u z u x P u z u x z P Bayes z = observaion u =

More information

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients Secion 3.5 Nonhomogeneous Equaions; Mehod of Undeermined Coefficiens Key Terms/Ideas: Linear Differenial operaor Nonlinear operaor Second order homogeneous DE Second order nonhomogeneous DE Soluion o homogeneous

More information

HW6: MRI Imaging Pulse Sequences (7 Problems for 100 pts)

HW6: MRI Imaging Pulse Sequences (7 Problems for 100 pts) HW6: MRI Imaging Pulse Sequences (7 Problems for 100 ps) GOAL The overall goal of HW6 is o beer undersand pulse sequences for MRI image reconsrucion. OBJECTIVES 1) Design a spin echo pulse sequence o image

More information

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear

The Rosenblatt s LMS algorithm for Perceptron (1958) is built around a linear neuron (a neuron with a linear In The name of God Lecure4: Percepron and AALIE r. Majid MjidGhoshunih Inroducion The Rosenbla s LMS algorihm for Percepron 958 is buil around a linear neuron a neuron ih a linear acivaion funcion. Hoever,

More information

Distributed Language Models Using RNNs

Distributed Language Models Using RNNs Disribued Language Models Using RNNs Ting-Po Lee ingpo@sanford.edu Taman Narayan amann@sanford.edu 1 Inroducion Language models are a fundamenal par of naural language processing. Given he prior words

More information

Energy Storage Benchmark Problems

Energy Storage Benchmark Problems Energy Sorage Benchmark Problems Daniel F. Salas 1,3, Warren B. Powell 2,3 1 Deparmen of Chemical & Biological Engineering 2 Deparmen of Operaions Research & Financial Engineering 3 Princeon Laboraory

More information