Implementation and Optimization of Differentiable Neural Computers

Size: px

Start display at page:

Download "Implementation and Optimization of Differentiable Neural Computers"

Alexander Barrett
6 years ago
Views:

1 Implemenaion and Opimizaion of Differeniable Neural Compuers Carol Hsin Graduae Suden in Compuaional & Mahemaical Engineering Sanford Universiy cshsin[a]sanford.edu Absrac We implemened and opimized Differeniable Neural Compuers (DNCs) as described in he Oc DNC paper [1] on he babi daase [25] and on copy asks ha were described in he Neural Turning Machine paper [12]. This paper will give he reader a beer undersanding of his new and promising archiecure hrough he documenaion of he approach in our DNC implemenaion and our experience of he challenges of opimizing DNCs. Given how recenly he DNC paper has come ou, oher han he original paper, here were no such explanaion, implemenaion and experimenaion of he same level of deail as his projec has produced, which is why his projec will be useful for ohers who wan o experimen wih DNCs since his projec successfully rained a high performing DNC on he copy ask while our DNC performance on he babi daase was beer han or equal o our LSTM baseline. 1. Inroducion One of he main challenges wih LSTMs/GRUs is ha hey have difficuly learning long-erm dependencies due o having o sore heir memories in hidden unis ha are essenially compressed, weigh-seleced inpu sequences. Hence, he exciemen when DeepMind released a paper his Ocober documening heir resuls on a new deep learning archiecure, dubbed he Differeniable Neural Compuer (DNCs), wih read and wrie/erase access o an exernal memory marix, hus allowing he model o learn over much longer ime sequences han curren LSTMs models. In his paper, we firs give a brief overview of how he DNC fis ino he deep learning hisorical framework since we view DNCs as RNNs he way LSTMs are RNNs, as explained in Sec. 3. The res of he paper focuses on he implemenaion and opimizaion of DNCs, which is essenially an unconsrained, non-convex opimizaion problem for which he KKT saionariy condiion mus hold for local opima (given ha vanishing gradiens is essenially no a problem for DNCs). A basic machine learning and linear algebra background has been assumed, so common funcions and conceps are used wihou inroducion wih some explained in he appendix in Sec. 6. We discovered ha undersanding and implemening a correc DNC is a non-rivial process as documened in Sec. 3. We also discovered ha he DNCs are difficul o rain, ake a long ime o converge and are prone o over-fiing and insabiliy hroughou he learning process, which were he same challenges researchers have experienced in raining NTMs, which are he DNC s direc predecessors [27]. Due o he compuaional cos in boh memory and ime of raining DNCs, we were no able o rain a DNC on he full join babi asks, bu on a join subse of asks ha would be rainable wihin a reasonable ime frame (3 days insead of weeks) as explained in Sec The DNC model was successful a performing beer han or equal o he LSTM baseline on hese asks. We also ran experimens on he copy ask described in [12], of which he resuls were highly successful and are documened in he appendix in Sec. 6.1 since his paper coul no fi all of he copy ask experimenaions and visualizaions in addiion o discussing he approach (which also served as a background on he heory of DNCs in addiion o documening our implemenaion process) and he full babi experimens, since even some of he babi plos and figures had o be placed in he appendix. We wan o emphasize ha he reader should consider he copy ask secion in he appendix as a coninuaion of he paper (even hough i was placed in he appendix o save space) since given he simpliciy and low RAM requiremens of he copy ask, more visualizaions using Tensorboard was possible, so Sec. 6.1 conains many plos ha can help he reader beer undersand DNCs, such as he visualizaion of DNC memory marices in Fig. 10. We implemened he DNC wih much uni-esing in Pyhon using Tensorflow 1.0. The experimens were on a deskop CPU and GPU, and an Azure VM wih one GPU. 2. Background/relaed work While neural neworks (NNs) have a long hisory [20], hey have mainly recenly gained populariy due o he recen availabiliy of large daases and massively parallel compuing power, usually in he form of GPUs, having creaed an environmen ha allowed he raining of deeper and more complex archiecures in a reasonable ime frame. As such, deep learning echniques have received widespread aenion by ouperforming alernaive mehods (e.g. kernel machines and HMMs) on benchmark problems firs in compuer vi- 1

2 sion [16] and speech recogniion [8] [11], and now in naural languge processing (NLP). The laer by ouperforming on asks including dependency parsing [4] [2], senimen analysis [21] and machine ranslaion [15]. Deep learning echniques have also performed well agains benchmarks in asks ha combine compuer vision and NLP, such as in imagecapioning [24] and lip-reading [7]. Wih hese successes, here seems o be a shif from radiional algorihms using human engineered feaures and represenaions o deep learning algorihms ha learn he represenaions from raw inpus (pixels, characers) o produce he desired oupu (class, sequence). In he case of NLP, much of he improvemens come from he sequence modeling capabiliies of recurren NNs (RNNs) [10] [23], wih heir abiliy o model long-erm dependencies improved by using a gaed acivaion funcion, such as in he long shor-erm memory (LSTM) [14] and gaed-feedback uni (GRU) [5] [6] RNN archiecures. The RNN model exends convenional feedforward neural neworks (FNNs) by reusing he weighs a each ime sep (hus reducing he number of parameers o rain) and condiioning he oupu on all previous words by a hidden memory sae (hus remembering he pas). In an RNN, each inpu is a sequence (x 1,..., x,..., x T ) of vecors x R X and each oupu is he predicion sequence (ŷ 1,..., ŷ,..., ŷ T ) of vecors ŷ R Y, usually made ino a probabiliy disribuion using he sofmax funcion. The hidden sae h R H is updaed a each ime sep as a funcion of a linear combinaion of he inpu x and he previous hidden sae h 1. Below, f( ) is usually a smooh, bounded funcion such as he logisic sigmoid σ( ) or hyperbolic angen anh( ) funcions ( [ ]) x h = f W ŷ h = sofmax(uh ) 1 Theoreically, RNNs can capure any long-erm dependencies in arbirary inpu sequences, bu in pracice, raining an RNN o do so is difficul since he gradiens end o vanish (o zero) or explode (o NaN, hough solved by clipped gradiens) [18] [13] [14]. The LSTM and GRU RNN models boh address his by using gaing unis o conrol he flow of informaion ino and ou of he memories; he LSTM, in paricular, have dedicaed memory cells and is he baseline archiecure used in our experimens. There are many LSTM formulaions and his projec uses he varian from [1] wihou he muliple layers: i = σ(w i [x ; h 1 ] + b i ) f = σ(w f [x ; h 1 ] + b f ) s = f s 1 + i anh(w s [x ; h 1 ] + b s ) o = σ(w o [x ; h 1 ] + b o ) h = o anh(s ) where i, f, s and o are he inpu gae, forge gae, sae and oupu gae vecors, respecively, and he W s and b s are he weigh marices and biases o be learned. Noe ha [1] calls he memory cell (or sae) s insead of he c generally used in lieraure [5] [6] [11] since [1] uses he leer c o denoe conen vecors in he DNC equaions, see Sec Wih he LSTM and GRU models, longer-erm dependencies can be learned, bu in pracice, hese models have limis on how long memories can persis, which become relevan in asks such as quesion answering (QA) where he relevan informaion for an answer could be a he very sar of he inpu sequence, which could consis of he vecorized words of several paragraphs [19]. This is he moivaion for fully differeniable (hus rainable by gradien descen) models ha conain a long-erm, relaively isolaed memory componen ha he model can learn o sore inpus o and read from when compuing he prediced oupu. Such memory-based models include Neural Turing Machines [12], Memory Neworks (MemNes and MemN2Ns) [26] [22], Dynamic Memory Neworks (DMN) [17] and Differeniable Neural Compuers (DNCs) [1], which can be viewed as a ype of NTM since hey were designed by he same researchers. Mem- Nes and DMNs are based on using muliple RNNs (LSTMs/GRUs) as modules, such separae RNN modules for inpu and oupu processing as in [23]. The DMN uses a GRU as he memory componen as opposed o he addressable memory in MemNes/MemN2Ns, NTMs and DNCs. While Mem- Nes/MemN2Ns have primarily been esed on NLP ask, NTMs were esed mainly on algorihmic asks, such as copy, recall and sor, while DNCs were esed a wider variey of asks achieving high performance in NLP asks on he babi daase [25], algorihmic asks such as compuing shores pah on graphs and a reinforcemen learning ask on a block puzzle game. Thus, DNCs are powerful models ha have a long-erm memory isolaed from compuaion, which RNNs/LSTMs lack, and have he poenial o replicae he capabiliies of a modern day compuer while being fully differeniable and hus rainable using convenional gradien-based mehods, which is why hey are he main opic of ineres of his projec. 3. Approach: DNC overview & implemenaion We implemened he DNC by inheriing from he Tensorflow RNNCell inerface, which is an absrac objec represening an RNN layer, even hough i is called a cell. This implemenaion approach akes advanage of Tensorflow s buil-in unraveling capabiliies so all ha is needed is o program he DNC logic in (oupu, new sae) = self. call (inpus, sae), in addiion o he oher, more rivial, required funcions. As shown in Fig. 1, his projec organizes he main DNC logic ino hree main module, shown in grey boxes which conain he in which secion hey are described DNC uiliy funcions Mos of he DNC uiliy funcions in Sec were easy o implemen or already provided in Tensorflow. Ohers were implemened based on recognizing how hey could be vecorized, such as for he conen similariy based weighing 2

3 Figure 1: Our approach o implemening he DNC. C(M, k, β), which resuls in a probabiliy disribuion vecor, i.e. in S N. The equaion in he original paper was wrien using cosine similariy D(, ) as C(M, k, β)[i] = exp{d(k, M[i, ]) β} j exp{d(k, M[i, ]) β} i = [1 : N] bu i could be vecorized and implemened as z = ˆM ˆk C(M, k, β) = sofmax(βz) where ˆM is M wih l 2 -normalized rows and ˆk is l 2 - normalized k. Noice ha normalizaion (used in he cosine similariy funcion Sec ) will resul in NaN errors due o division by zero since he memory marix M is iniialized o zero. This could be solve by adding a facor of ɛ = 1e 8 o he denominaor or insering a condiion o ignore he compuaion if he denominaor was zero, boh were implemened before he discovery of Tensorflow s buil-in normalize funcion, which was he roue aken in he end DNC NN Conroller We firs concaenae he inpu vecor of he curren ime sep x and he read vecors from he previous ime sep r 1 1,..., r R 1 o form he NN conroller inpu χ = [x ; r 1 1; ; r R 1] which is hen fed ino he NN cell of he NN conroller, where cell can be an LSTM or a FNN, which was coded using Tensorflow s defaul cell funcions wih he mahemaical definiions in Sec For consisency wih he LSTM, call h he oupu of he NN cell a ime sep, hen he final NN conroller s oupu is a marix produc wih h : [ ] ξ = W ν h h where W h is among he weighs θ learned by he gradien descen. Thus, if we le N refer o he NN conroller, hen we can wrie he NN conroller as: (ξ, ν ) = N ([χ 1 ; ; χ ]; θ) where [χ 1 ; ; χ ] indicaes dependence on previous elemens of he curren sequence. In his projec, afer he vecor concaenaion, he conroller is implemened using Tensorflow s defaul funcions rnn.basiclst M Cell and layers.dense for he NN cell, hen a final weigh marix muliply and vecor slicing o ge he wo vecors ξ and ν. As shown in Sec , he inerface vecor ξ is spli ino he inerface parameers, some of which were processed o he desired range by he uiliy funcions in Sec The inerface parameers are hen used in he memory updaes as explained in Sec The oher NN Conroller oupu, ν is used o compue he final DNC oupu y as shown in Sec in a linear combinaion wih he read vecors [r 1 ;... ; r R ] from he memory as shown in Sec DNC memory module implemenaion The inerface parameers are used o updae he memory marix (denoed as M R N W where N is he number of memory locaions, aka cells, and W is he memory word size, aka cell size) and compue inermediae memory-compuaion relaed vecors ha are used o compue he memory read vecors [r ; ; r R ] ha are used o boh compue he oupu of he curren ime sep y and o feed as inpu ino he NN conroller as par of χ +1 in he nex ime sep. In his projec, he memory ineracions are organized based on how hese ineracions can be divided ino wriing o memory and reading from memory, and hen furher subdivided based on conen-based address weighings and wha his projec calls hisory -based address weighings, since hey are compued based on previous read and wrie weighings. We use his noaion afer observing similariies beween conen vs hisory focusing in he DNC equaions and conen vs locaion focusing in he NTMs equaions [12]. To make his par of he paper more clear, we creaed a diagram of his projec s organizaion of he DNC memory implemenaion shown in Fig. 2. Hexagons mark he inerface variables compued from he NN conroller. Blue diamonds are variables ha are required o be kep and updaed each ime sep. Dashed circles imply inermediae compued values ha are merely consumed and forgoen. The variables used in he figures are he same as defined in [1], wih he DNC glossary in [1] reproduced in Sec for convenience, as are he memory equaions from he [1], which can be viewed in Sec Observe from Fig. 2 ha he DNC is srucurally similar o an LSTM excep he DNC has muliple vecor-soring memory cells (vs jus one in an LSTM) and he DNC has addressing mechanisms o choose from which memory cell(s) o wrie and read Conen-based wrie and read weighings An advanage of memory cells wih vecor values is o allow for a conen-based addressing, or sof aenion, mechanism ha can selec a memory cell based on he similariy (cosine for he DNC and NTM) of is conens o a specified key k [3] [9]. Recall ha he DNC uses he same conen-based focusing mechanism used in NTMs [12]; he main changes beween he wo papers were in noaion, which in he DNC is 3

4 Figure 2: Diagram of our approach o organizing and implemening he DNC memory module. c = C(M, k, β) as defined in Sec and implemened by his projec as sofmax over he marix produc of he normed ensors (M and k) scaled by β [0, ) as shown in Sec As saed in Sec. 3.1, he resul c can be viewed as a probabiliy disribuion over he rows of M based on he similariy of each row (cell) o k as weighed by β. Thus, in Fig. 2, he box Conen-based weighing for wrie is c w = C(M 1, k w, β w ) and he box Conen-based weighing for read is c r,i = C(M, k r,i, β r,i ) i [1 : R] From Fig. 2, observe ha c w, is compued firs in order o compue he final wrie weighing w w o updae (erase and wrie o) he memory, so M 1 M, he updae shown in he dashed lines. Then each c r,i is used o compue each final read weighing wr,i w used o produce each read vecor ri from he updaed memory, M Hisory-based wrie weighing In Fig. 2, he box Hisory-based wrie weighing is also k he process dubbed dynamic memory allocaion by [1] ψ = R (1 f i w r,i 1 ) u = (u 1 + w w 1 (u 1 w w 1)) ψ φ = SorIndicesAscending(u ) j 1 a [φ [j]] = (1 u [φ [j]]) u [φ [j]] As defined in Sec , we have as inpus he R free gaes f i [0, 1] from he inerface vecor, he R previous imesep read weighings w r,i 1 N, he previous ime-sep wrie weighing w 1 w N and he previous ime-sep usage vecor u. The oupu is he allocaion ( hisory -based) wrie weighing a N, which will be convexly combined wih he conen-based wrie weighing c w for he final wrie weighing w w. The inermediae variables are he memory reenion vecor ψ [0, 1] N and indices φ N N of slos sored by usage. The general idea behind hese equaions is o bias he selecion of memory cells (o erase and wrie o) as deermined by a oward hose ha he DNC has recenly read from (ψ ) and away from cells he DNC has recenly wrien o (w 1). w See ha ψ will be lower for memory slos ha have been recenly read, which can be analogous o indicaing ha a memory has been parsed or consumed, and if a high free gae, deermined o be insignifican, so he DNC can forge hose memories and wrie new ones o hose slos; and he converse for slos wih high ψ, indicaing hey should be reained (no erased). Inuiively, he usage vecor u racks memory cells sill being used, i.e. was used (u 1 ), been wrien o (w 1) w and deemed significan (by ψ ), so reained. The only complicaion in he implemenaion was he soring and rearrangemen. We sored using Tensorflow s op k, which gives he indices along wih a sored ensor. Le û denoe he sored u and â be he a in arrangemen by φ. Then (-û, φ ) = op k(-u, k = N) and â is rivially compued from û. The main difficuly is in he rearrangemen since Tensorflow ensors are immuable, one canno preallo- 4

5 cae a ensor and hen index ino i o change is value. Our soluion was o consruc a permuaion marix by urning he indices ino one-ho vecors, so if we had â = [a; b; c] wih φ = [2; 3; 1], hen b a a = c = b = P (φ )â a c where P ( ) urns is vecor inpu ino a permuaion marix. Noe ha his mehod may no be he mos space efficien and may have added o he RAM and compuaional ime of he DNC raining process, bu we could no hink of a beer and cleaner way of implemening his Final wrie weighing and memory wries In Fig. 2, he box Final wrie weighing is w w = g w [g a a + (1 g a )c w ] Since g a [0, 1], observe ha g a a + (1 g a )c w is a convex combinaion of he hisory -based wrie weighing a and he conen-based weighing c w, so if g a 1, hen he DNC ignores he conen-based weighings and vice versa in choosing which memory slos o erase/wrie. The wrie gae g w [0, 1] dicaes wheher o updae a all, so if g w 0, hen no slos will be erased/wrien. As shown in Fig. 2, he wrie weighing w w is used o updae he memory marix, compue he hisory -based read weighings and hen is kep as he new w 1 w for he nex ime-sep. The box Wrie o memory, which is M = M 1 (E w w e ) + w w v complees he wriing porion of he memory module. The erase vecor e [0, 1] W and wrie vecor v R W were unpacked from he inerface vecor, see Sec and E={1} N W, so he equaion can be wrien wihou E as M = M 1 M 1 w w e + w w v }{{}}{{} erase wrie so row j of w w e is he erase vecor scaled by he wrie weighing of memory cell j; similarly for w w v. Noe ha he hadamard produc wih M 1 isolaes erasing from wriing since wihou i, empy memory slos would ge wrien o by he erase insead of leaving a cleaner slo for he wrie. These implemenaions are rivial, so no discussed Hisory-based read weighings Recall from Sec ha he idea behind he hisory-based wrie weighing was o bias he selecion of memory cells (o be erased/wrien o) oward hose ha were a combinaion of being mos recenly read from, leas recenly wrien o, or deemed inconsequenial by he free gaes. In conras, he R hisory-based read weighings selec memory cells (o read from) based on he order in which he cells were wrien o in relaion o he wriing ime of he cells read from in he previous ime-sep, so he preference is for cells wrien o righ before (as measured by backward weighings b i N ) or afer (as measured by forward weighings f i N ) he ime a which he cells he DNC jus read from were wrien o, which is eviden from he equaions for he Fig. 2 box Hisory-based read weighings: ( ) N p = 1 w w [i] p 1 + w w L [i, j] = (1 w w [i] w w [j])l 1 [i, j] + w w [i]p w 1[j] L [i, i] = 0 i [1 : N] f i = L w r,i 1 b i = L w r,i 1 where he precedence weighing p N keeps rack of he degree each memory slo was mos recenly wrien o. Observe ha he DNC updaes p based on how much wriing occurred a he curren ime-sep as measured by he curren wrie weighing w w N, so if w w 0, hen barely any wriing happened a his ime-sep, so p p 1, indicaing ha he wrie hisory is carried over; and if 1 w w 1, he previous precedence is nearly replaces, so p w w. Updaing based on how much wriing happened is also buil ino he recursive equaions for he emporal memory link marix L [0, 1] N, which racks he order in which memory locaions were wrien o such ha each row and column are in N, so 1 L 1 and L 1 1 where is applied elemen-wise. Observe ha w w [i]p w 1[j] is he amoun wrien o memory locaion i a his ime-sep imes he exen locaion j was wrien o recenly, so L [i, j] is he exen memory slo i was wrien o jus afer memory slo j was wrien o in he previous ime-sep. Furher observe ha he more he DNC wries o eiher i or j, he more he DNC updaes L [i, j], so if no much was wrien o hose slos a he curren ime-sep, he previous ime-sep links are mosly carried over. Also observe ha he link marix recursion can be vecorized as ˆL = [ E w w 1 1(w w ) ] L 1 + w w (p w 1) L = ˆL (E I) removes self-links where I is he usual ideniy marix. Noe ha hese equaions are rvially implemened wih Tensorflow broadcasing, so here is no need o acually compue he wo ouer producs of w w wih 1. As explained earlier, ime goes forward from columns o rows in he link marix, so he equaions for f i (propagaing forward once) and b i (propagaing backward once) are inuiive, as are he implemenaions, so neiher is discussed Final read weighing and memory reads Similar o he wrie weighing w w, he i h read weighing w r,i is a convex combinaion of he corresponding conenbased read weighing, c r,i, and he hisory based read weighings, f i and b i, as eviden from he equaions for Fig. 2 box 5

6 Final read weighing w r,i = π i [1]b i + π i [2]c r,i + π i [3]f i where π i S 3 is he read mode vecor ha governs he exen he DNC prioriizes reading from memory slos based on he reverse order hey were wrien (π[1]), i conen similariy (π[2]) i or he order hey were wrien (π[3]). i As seen in Fig. 2, he i h read weighing w r,i is hen passed o he Read from memory box r i = M w r,i hus, producing he i h read vecor r i R W. These implemenaions are rivial, so no discussed DNC final oupu As show in Fig 1, hese R read vecors are hen concaenaed o compue he final DNC oupu y R Y r 1 y = W r. r R + ν as explained in Sec The concaenaed R read vecors are also concaenaed wih he nex inpu x +1 o produce χ +1 o feed ino he NN conroller a he nex ime-sep as explained in Sec These implemenaions are rivial, so no discussed. 4. Experimens and Resuls As explained in Sec. 3, we implemened he DNC as a Tensorflow RNNCell objec, so he DNC can be used as one would use Tensorflow s BasicLSTMCell, which is also an RNNCell, wihou needing o manually implemen he sequence unrolling. This mean he code used o run a DNC raining session can be used o run a session for any RNNCell objec by swiching he RNNCell objec, i.e. from he DNC o an LSTM, which was he baseline model used for all he experimens in his projec. Thus, in all he experimens, an LSTM baseline model was firs run, boh as a reference for he DNC loss curve and as furher correcness verificaion of he daa processing and model raining pipelines Copy experimens A he sar of he experimenaion on he babi daase, he resuls were raher poor and chaoic, so o debug and furher verify ha he DNC was correcly implemened, we decided o experimen wih achieving high performance on he simpler copy asks, which were described in boh he DNC [1] and NTM [12] papers. The smaller and easier o visualize copy asks allowed us o have enough RAM o visualize he memory and emporal link marices, which helped in he fixing of bugs in he implemenaions. Since his projec should be focused primarily on NLP, he experimenaion and resuls from he copy asks, which were all highly successful, are in he appendix in Sec. 6.1 due o he page limis babi experimens The original inen of his projec was o reproduce he babi resuls of [1], however, given he large and complex daase, even hough we were using he GPU, raining he DNC on he full join babi asks was aking more han 15 hours complee even one epoch, which made he full 20 ask join experimen inconceivable given he ime frame and limied compue power. To undersand he bolenecks, we compued saisics over he babi daase as displayed in Fig. 17 and decided ha some asks were jus oo big, e.g. Task 3 he max inpu sequence lenghen of 1920, o be rained in a reasonable ime frame. Therefore, we seleced he smallersized asks from he babi daase, specifically he 6 asks 1, 4, 9, 10, 11, 14, for he join-raining insead of he full 20 asks. For all of he experimens, unless oherwise specified, he models were rained wih a bach size of 16 using RMSPprop wih learning rae 1e-4 and momenum 0.9; and he DNC seings were N = 256, W = 64, R = 4 wih an LSTM conroller wih H = 256. The baseline LSTM model had he same H. The gradiens were clipped by he global norm wih hreshold Daa and merics The babi asks inpus consis of word sequences inerspersed wih quesions wihin self-conained sories and while he daases also included supporing facs ha could be used o srongly supervise he learning, we followed he DNC paper s seings and only considered he weakly supervised seing. For each sory k, we consruced inpu sequences x (k) of one-ho vecors x (k) [0, 1] V where V is he vocabulary size. We reserved a oken - o be he signal ha an answer is required and * o be he padding for sequences ha were shorer han he maximum sequence lengh, which we call T. The arge oupu vecor consised of * for posiions ha do no require answers and he vecorized word answers for he posiions wih - in he inpu sequence, so he maximum sequence lengh of x (k) is also T and x (k) [0, 1] V. For each sequence, a mask m (k) [0, 1] T such ha m (k) [] = 1{answer required a } was also compued and passed o he model wih he inpu and arge oupus in order o ignore he irrelevan prediced oupus a ime-seps where no answers were required. The loss was he average sofmax cross enropy wih logis loss, so he final oupu of boh he DNC cell and he LSTM cell was passed hrough he sofmax funcion, so if h was he cell oupu, hen ŷ = sofmax(h ) as defined in Sec Thus, he loss for a single predicion is L = 1 T T m L(y, ŷ ) =1 V L(y, ŷ ) = y [i] log(ŷ [i]) 6

7 The accuracy is he average number of quesions answered exacly correc, so if a quesion has wo words in he answer, he model has o ge boh words in he righ order o have i marked correc Single asks Before expending he compuaional power o rain he DNC on he join daase, DNC models were rained separaely on single asks boh o verify he hyper-parameers were reasonable and ha he raining pipeline was correcly implemened. Task 1 and ask 15 were chosen for hese experimens. For each of he asks, he dev se was a randomly chosen 10% of he raining se reserved for uning; his was spli raio was also used for he join raining. Figure 3: Plos showing he DNC overfis iny daases (a) Task 1 DNC loss plos full daase. Since he full daase is 1, 4, 9, 10, 11, 14, he size and variey of he join daase may preven over-fiing in he full raining sep Join ask resuls We rained wo DNC models, each of which ook over wo days on he GPU. We rained a DNC model using Adam wih learning rae 1e-3 insead of RMSProp and one using he seings from he DNC paper [1], so wih RMPSprop wih learning rae 1e-4 and momemum 0.9, clipping he gradiens by value o [ 10, 10] insead of by global norm. However, we kep he bachsize of 16 insead of 1, which was in [1]. An LSTM baseline model was also rained. Please see he appendix Fig. 20 for he rain and dev comparison plos for all models. As can be seen in Fig. 4 and Fig. 5, he DNC does slighly beer han he LSTM in erms of loss and accuracy (bach-averaged). Figure 4: DNC vs LSTM join-ask rained models loss plos (b) Task 15 DNC loss plos Figure 5: DNC vs LSTM on join asks. L=loss, A=accuracy, A =accuracy on one bach Train L Dev L Train A Dev A DNC LSTM DNC overfied boh daases as shown in Fig. 19 and Fig. 3, bu he LSTM also overfis hese daases, see Fig. 18 in he appendix, so i may be ha he models are oo complex for hese smaller asks. However, hese experimens were sill useful since a common machine learning saniy check is o ensure a model can overfi a iny subse of he The join-rained DNC and he LSTM models were hen ran on he babi es ses for Tasks 1, 4, 9, 10, 11, 14 and he resuls as displayed in Fig. 5 show ha our DNC model performs beer han or equal o our LSTM baseline on all he asks. The mean accuracy ranges as defined by he sandard deviaions from he DNC paper [1] for he DNC and he LSTM were also provided in addiion o he weakly super- 7

8 Figure 6: DNC vs LSTM join-ask rained models per ask accuracy comparision on he es se Task our DNC our LSTM [1] DNC range [1] LSTM range [25] LSTM 1:single-supporing-fac [1.00,0.78] [0.67,0.53] :wo-arg-relaions [1.00,0.99] [1.00,0.99] :simple-negaion [1.00,0.84] [0.86,0.83] :indefinie-knowledge [1.00,0.79] [0.73,0.70] :basic-coreference [1.00,0.91] [0.91,0.84] :single-supporing-fac [0.96,0.81] [0.45,0.43] 0.27 vised LSTM model resuls from he babi paper [25] for he asks he experimen was run. Recall ha in all our babi experimens, we followed he DNC paper s seings in ha all he models were weakly supervised as he babi daase paper [25] calls i, in ha he models do no use he supporing facs o answer he quesions in he babi asks, so he models received no daa oher han he word sequences. The MemNN models ha achieved near perfec accuracy on he babi daase were using wha [25] ermed srong supervision and no resuls on he weakly supervised ask was provided, so we could no use heir numbers. Observe ha while our LSTM baseline had higher performance han he LSTM baseline from he babi paper [25], quie a few were no in he range of he mean and sandard deviaion of he DNC paper resuls [1] and he same held for our DNC model. We believe his is because we only had ime o rain our models for abou 30 epochs while [1] had he compuing resources o rain all of heir models o compleion in addiion o have 20 randomized models per archiecure ype. The comparisons may also be difficul since our models were only joinly rained on 6 ou of he 20 babi asks while he models in he lieraure were join-rained on all 20. As saed earlier, he we had ried o rain he DNC on all 20, bu afer running for over 15 hours, i had ye o complee even one epoch, le alone he low bar of 30 epochs we were aiming for 5. Conclusion Thus, we have implemened DNCs and conduced experimens wih hem on he copy and babi asks. The copy asks were highly successful and allowed for he visualizaion of he DNC hroughou he learning process, hus also serving as a cerificae of he correcness of he DNC implemenaion. The babi asks were more complex and herefore required more compuaional power han we curren have, which was why a scaled down experimen was conduced ha consis of he join-raining on 6 insead of he full 20 babi asks, each chosen for being smaller as shown in Fig. 17 and herefore faser o rain. Given more ime and compuing power, we would have liked o rain DNC models on he full 20 asks and be able o ierae over he models o ge beer hyper-parameers. Given more RAM, we would have liked o produce he visualizaions of he DNC learning process he way we did for he smaller copy asks as shown in Sec 6.1. We would also have liked o wrie he code o visualize he emporal link marix as he DNC passes hrough one inpu sequence o ge a beer undersanding of he mechanics of he DNC. We would have also liked o do a mahemaical exercise on he link marix equaions, kind of like in Sec. 6.3, o beer undersand he mechanics of is formulaion raher han jus he inuiion. We would also have liked o do more experimens on he single asks, such he he dropou experimen in Sec In conclusion, we were very horough in our documenaion of our approach o he undersanding and implemenaion of DNCs, along wih he challenges we faced in opimizing hem. We hope his projec will be useful for ohers ineresed in DNCs and/or machine learning archiecures wih exernal memory. References [1] G. W. Alex Graves. Hybrid compuing using a neural nework wih dynamic exernal memory. Naure, [2] D. Andor, C. Alberi, D. Weiss, A. Severyn, A. Presa, K. Ganchev, S. Perov, and M. Collins. Globally normalized ransiion-based neural neworks. CoRR, abs/ , [3] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine ranslaion by joinly learning o align and ranslae. CoRR, abs/ , [4] D. Chen and C. D. Manning. A fas and accurae dependency parser using neural neworks. In Empirical Mehods in Naural Language Processing (EMNLP), [5] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio. On he properies of neural machine ranslaion: Encoder-decoder approaches. CoRR, abs/ , [6] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Empirical evaluaion of gaed recurren neural neworks on sequence modeling. CoRR, abs/ , [7] J. S. Chung, A. W. Senior, O. Vinyals, and A. Zisserman. Lip reading senences in he wild. CoRR, abs/ , [8] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Conex-dependen pre-rained deep neural neworks for large-vocabulary speech recogniion. IEEE Transacions on Audio, Speech, and Language Processing, 20(1):30 42, [9] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, hp:// org. [10] A. Graves. Generaing sequences wih recurren neural neworks. CoRR, abs/ ,

[11] A. Graves, A. Mohamed, and G. E. Hinon. Speech recogniion wih deep recurren neural neworks. CoRR, abs/1303.5778, 2013. [12] A. Graves, G. Wayne, and I. Danihelka. Neural uring machines.

Kremer, ediors, Field Guide o Dynamical Recurren Neworks. IEEE Press, 2001. [14] S. Hochreier and J. Schmidhuber. Long shor-erm memory. Neural Compu., 9(8):1735 1780, Nov. 1997. [15] M. Johnson, M.

9 [11] A. Graves, A. Mohamed, and G. E. Hinon. Speech recogniion wih deep recurren neural neworks. CoRR, abs/ , [12] A. Graves, G. Wayne, and I. Danihelka. Neural uring machines. CoRR, abs/ , [13] S. Hochreier, Y. Bengio, and P. Frasconi. Gradien flow in recurren nes: he difficuly of learning long-erm dependencies. In J. Kolen and S. Kremer, ediors, Field Guide o Dynamical Recurren Neworks. IEEE Press, [14] S. Hochreier and J. Schmidhuber. Long shor-erm memory. Neural Compu., 9(8): , Nov [15] M. Johnson, M. Schuser, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thora, F. B. Viégas, M. Waenberg, G. Corrado, M. Hughes, and J. Dean. Google s mulilingual neural machine ranslaion sysem: Enabling zero-sho ranslaion. CoRR, abs/ , [16] A. Krizhevsky, I. Suskever, and G. E. Hinon. Imagene classificaion wih deep convoluional neural neworks. In F. Pereira, C. J. C. Burges, L. Boou, and K. Q. Weinberger, ediors, Advances in Neural Informaion Processing Sysems 25, pages Curran Associaes, Inc., [17] A. Kumar, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani, and R. Socher. Ask me anyhing: Dynamic memory neworks for naural language processing. CoRR, abs/ , [18] R. Pascanu, T. Mikolov, and Y. Bengio. Undersanding he exploding gradien problem. CoRR, abs/ , [19] B. Richardson and Renshaw. Mces: A challenge daase for he open-domain machine comprehension of ex, [20] J. Schmidhuber. Deep learning in neural neworks: An overview. CoRR, abs/ , [21] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. P. Pos. Recursive deep models for semanic composiionaliy over a senimen reebank. In EMNLP, [22] S. Sukhbaaar, a. szlam, J. Weson, and R. Fergus. End-o-end memory neworks. In C. Cores, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garne, ediors, Advances in Neural Informaion Processing Sysems 28, pages Curran Associaes, Inc., [23] I. Suskever, O. Vinyals, and Q. V. Le. Sequence o sequence learning wih neural neworks. CoRR, abs/ , [24] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and ell: A neural image capion generaor. CoRR, abs/ , [25] J. Weson, A. Bordes, S. Chopra, and T. Mikolov. Towards aicomplee quesion answering: A se of prerequisie oy asks. CoRR, abs/ , [26] J. Weson, S. Chopra, and A. Bordes. Memory neworks. CoRR, abs/ , [27] W. Zhang, Y. Yu, and B. Zhou. Srucured memory for neural uring machines. CoRR, abs/ , Appendix 6.1. Copy ask Since DNCs are he descendens of NTMs, hey should have he same capabiliies as NTMs. In he NTM paper, he researchers came up wih a copy ask where given inpus of random bi-vecor sequences such ha he sequence lenghs varied beween 2 and 20, he NTMs were asked wih oupuing a copy of he inpu afer i received he enire inpu sequence and a delimied flag indicaing he inpu has ended and o sar he copy [12]. The rained NTMs were hen given inpus of sequence lengh larger han he sequences on which hey was rained, such as lengh 30, o see if he NTMs can generalize he copy algorihm, which is an experimen ha was no presen in he DNC paper [1] even hough he researchers used he copy ask o verify memory allocaion and he speed of sparse link marices. We figure copy generalizaion would be an ineresing experimen for he DNC. In addiion, he srucural simpliciy of he copy ask also allowed for verificaion of he DNC implemenaion on a iny copy ask where he sequence lenghs were fixed o be 3. Throughou hese experimens, an LSTM served as he baseline model wih hidden layer size H = 100. The DNC models had a FNN as he LSTM as he NN conroller as opposed o a LSTM so he DNC could only sore he memories in is memory marix. All models were rained using RMSPprop wih learning rae 1e-5 and momenum Daa We used seings from boh he DNC paper [1] and he NTM paper [12] since he NTM paper had more exensive copy ask experimens. As in he papers, he inpus consis of a sequence of lengh 6 random binary vecors, so he domain is {0, 1} 6, bu since he las bi is reserved for he delimier flag, d, which ells he model o reproduce ( predic ) he inpu sequence, he model acually receives bi vecors in {0, 1} 7. Call T he maximum sequence lengh, hus he model is fed inpus of sequence lengh 2T+1, which is he same as he arge oupu sequence lengh as depiced in Fig. 7. Figure 7: Example of T = 3 and b = 6 copy daa where x 4 is he delimier, so x 1:3 [1 : 6] is he relevan inpu and y 5:7 [1 : 6] is he relevan oupu. (a) Inpu sequence x (b) Targe oupu sequence y Since he experimens include variable in addiion o fixed sequence lenghs, call T k he sequence lengh of inpu sequence k, hen he inpu is a (2T+1)-lengh sequence x (k) = [x 1 ;... ; x Tk ; d; 0;... ; 0] and he arge oupu is a sequence of he same lengh y (k) = [0;... ; 0; x 1 ;... ; x Tk ; 0;... ; 0] which is beer depiced in Fig. 8. In boh fixed and variable sequence lengh cases, a mask vecor m (k) {0, 1} 2T+1 such 9

ha m (k) = 1{y (k) is relevan} is also included o be used in he loss and accuracy funcions. Figure 9: DNC wih 2-layer FNN converges much faser han he LSTM or 1-layer FNN DNC.

(a) Inpu sequence x (b) Targe oupu sequence y Figure 10: Examples of memory marices hroughou he raining of he DNC on he fixed sequence lengh of T = 3. 6.1.2 Loss funcion and merics We rained he model on sigmoid cross enropy loss where irrelevan oupus are masked, so he loss for predicion [ŷ 1 ;.

.. ; ŷ 2T+1 ] is calculaed based on he average number of bi maches wih he arge oupu [y 1 ;... ; y 2T+1 ] using he mask m o ignore irrelevan porions See Fig.

10 ha m (k) = 1{y (k) is relevan} is also included o be used in he loss and accuracy funcions. Figure 9: DNC wih 2-layer FNN converges much faser han he LSTM or 1-layer FNN DNC. Figure 8: Example of T = 20 and T k = 5 copy daa where x 6 is he delimier flag. (a) Inpu sequence x (b) Targe oupu sequence y Figure 10: Examples of memory marices hroughou he raining of he DNC on he fixed sequence lengh of T = Loss funcion and merics We rained he model on sigmoid cross enropy loss where irrelevan oupus are masked, so he loss for predicion [ŷ 1 ;... ; ŷ 2T+1 ], arge [y 1 ;... ; y 2T+1 ] wih mask m is L = 1 2T+1 1 m L(ŷ, y ) m =1 L(ŷ, y ) = 1 2T+1 y [i] log ŷ [i]+(1 y [i]) log(1 ŷ [i]) 6 The accuracy of a predicion [ŷ 1 ;... ; ŷ 2T+1 ] is calculaed based on he average number of bi maches wih he arge oupu [y 1 ;... ; y 2T+1 ] using he mask m o ignore irrelevan porions See Fig. 10 for images of he memory marix produced wih Tensorboard. Noe ha hese memory marices are no in any order, hey were jus chosen o show ha he DNC is wriing o specific memory cells (i.e. rows of M) and seems o be erasing ohers (see he ghos rows). Figure 11: Three examples of prefec predicion resuls for he fully rained DNC wih fixed sequence lengh T = 3, so only ŷ 5:7 [1 : 6] is relevan. A = 1 2T m m 6 =1 6 1{ŷ [i] = y [i]} Model checks and opimizaion on iny ask Due o he expensive compuaional requiremens of raining DNCs, as a furher correcness check, we rained he DNC wih inpus of a fixed sequence lengh of T = 3 raher han he full ask of inpus wih variable sequence lengh beween 2 and 20. We used his inier problem as a saniy check on he DNC since i should be able o ge o zero loss on his problem very quickly and he iny ask was also used for hyper-parameer uning along wih esing ou Tensorboard capabiliies. For he iny copy ask, he DNC seings were N = 10, W = 12, R = 1. As observed from he loss plos in Fig. 9, he DNC wih a 1-layer FNN conroller reached zero loss by sep 6k is loss curves closely follows he LSTM baseline, bu a DNC wih a 2-layer FNN conroller reached zero loss by sep 3k. Since i is good pracice o saniy check he sar loss, observe ha since his is basically binary classificaion, he sar loss should be log(0.5) 0.7, which is rue for all models. See Fig. 11 for examples of sample predicions and arges produced wih Tensorboard. Noe ha only he las hree columns of each predicion sequence is relevan since DNC was receiving an inpu of sequence lengh hree followed by a delimier for he firs four ime-seps. Recall ha since raining deep learning models is equivalen o opimizing a non-convex unconsrained opimizaion problem, anoher way o check convergence (graned ha we are no experiencing vanishing gradiens) is o check ha he gradien norms converges o zero, which in he KKT condiions indicae convergence o a local opima. See Fig. 12 for he Tensorboard plo checking ha he models have converged. Observe ha he DNC models show convergence, 10

Figure 12: The gradien norms vs ieraion for all hree models for copy ask on T = 3 bu are more unsable wih gradien spikes (over 2000 for he DNC) ha need o be clipped o preven he learning process from

Tensorboard was used o display images of he predicions and arges hroughou he DNC learning process, which is shown in Fig. 14.

11 Figure 12: The gradien norms vs ieraion for all hree models for copy ask on T = 3 bu are more unsable wih gradien spikes (over 2000 for he DNC) ha need o be clipped o preven he learning process from geing side-racked by he plaeaus described in [18] [13]. Figure 14: Sampling of predicion resuls from DNC hroughou raining on varied seq. lens in [2,20]. (a) Early-raining Full copy ask For he full copy ask, he DNC seings were N = 20, W = 12, R = 1. Tensorboard was used o display images of he predicions and arges hroughou he DNC learning process, which is shown in Fig. 14. For he same number of ieraions, he learning process for he DNC ook 8 hours and he LSTM ook half an hour o raining on he CPU (GPU was already being used) using RMSPprop wih learning rae 1e-5 and momenum 0.9. Figure 13: DNC vs LSTM loss plos for seq lens [2,20] no surpass 5. (b) Mid-raining (c) Lae-raining Figure 15: The gradien norms vs ieraion for he 2-layer DNC rained on he variable lengh copy ask Figure 16: DNC vs LSTM resuls The loss plos in Fig 13 show ha he DNC learns faser and beer han he LSTM, bu also ha he DNC learning process is very unsable as here ends o be huge spikes in he loss plos even hough he gradiens have been clipped by he global norm, as observed in Fig. 15 where he norms do Model Loss on T k [2, 20] Acc on T k = 30 LSTM DNC e To es how well he models generalize he copy ask, hey 11

12 were esed on a bach of 100 inpu sequences where he sequence lengh was T = 30. The resuls in Fig. 16 show ha he DNC can beer generalize he copy ask han he LSTM, which was a confirmaion of our hypohesis DNC equaions and definiions from [1] Glossary N = {α R N α i [0, 1], 1 α 1} S N = {α R N α i [0, 1], 1 α = 1} Conroller Updae χ = [x ; r 1 1; ; r R 1] (ξ, ν ) = N ([χ 1 ; ; χ ]; θ) where N ( ) is he neural nework based conroller ha consiss of a cell ha in his projec was eiher a FNN or LSTM ha oupus h as a funcion of he inpu χ, and h 1 if an LSTM. So if cell is a FNN, i has he form: h = relu(w χ χ + b χ ) oherwise, cell is he LSTM of he form: i = σ(w i [χ ; h 1 ] + b i ) f = σ(w f [χ ; h 1 ] + b f ) s = f s 1 + i anh(w s [χ ; h 1 ] + b s ) o = σ(w o [χ ; h 1 ] + b o ) h = o anh(s ) A linear operaion is used o ge (ξ, ν ): [ ξ ] = W ν h h Inerface (ξ ) unpacking Definiions 1 σ(x) = 1 + e x [0, 1] oneplus(x) = 1 + log(1 + e x ) [1, ) e xi sofmax(x) i = x [0, 1] j=1 exj C(M, k, β)[i] = D(u, v) = u v u v exp{d(k, M[i, ]) β} j exp{d(k, M[i, ]) β} S N cosine similariy (A B)[i, j] = A[i, j] B[i, j] hadamard produc (x y)[i, j] = x[i] y[i] hadamard produc Iniial Condiions Spli he vecor ξ R (W R)+3W +5R+3 ino he following componens, hen use he uiliy funcions in Sec o preprocess some of he componens. ξ = k r,1 ṭ ˆπ R. k r,r ˆβ r,1 ṭ. ˆβ r,r k w ˆβ w ê v ˆf ṭ 1. ˆf R ĝ a ĝ w ˆπ Memory Updaes read keys kr,i R W read srenghs βr,i = oneplus( r,i ˆβ ) R } wrie key k w R W } wrie srengh β w = oneplus( ˆβ w ) R } erase vecor e = σ(ê ) R W } wrie vecor v R W free gaes f i = σ( ˆf i ) R } allocaion gae g a = σ(ĝ a ) R } wrie gae g w = σ(ĝ w ) R read modes πi = sofmax(ˆπ ) i R 3 u 0 = 0; p 0 = 0; L 0 = 0; L [i, i] = 0, i The original paper equaions: 12

13 R ψ = (1 f i w r,i 1 ) u = (u 1 + w w 1 (u 1 w w 1)) ψ φ = SorIndicesAscending(u ) j 1 a [φ [j]] = (1 u [φ [j]]) u [φ [j]] c w = C(M 1, k w, β w ) w w = g w [g a a + (1 g a )c w ] M = M 1 (E w w e ) + w w v ( ) N p = 1 w w [i] p 1 + w w L [i, j] = (1 w w [i] w w [j])l 1 [i, j] + w w [i]p w 1[j] f i = L w r,i 1 b i = L w r,i 1 c r,i w r,i = C(M, k r,i, β r,i ) = π i [1]b i + π i [2]c r,i r i = M w r,i + π i [3]f i 6.4. babi experimens, supplemenary info Figure 17: babi daa asks saisics. Task vocab_size maxlenx max#q qa1_single-supporing-fac qa2_wo-supporing-facs qa3_hree-supporing-facs qa4_wo-arg-relaions qa5_hree-arg-relaions qa6_yes-no-quesions qa7_couning qa8_liss-ses qa9_simple-negaion qa10_indefinie-knowledge qa11_basic-coreference qa12_conjuncion qa13_compound-coreference qa14_ime-reasoning qa15_basic-deducion qa16_basic-inducion qa17_posiional-reasoning qa18_size-reasoning qa19_pah-finding qa20_agens-moivaions Figure 18: Loss plos showing LSTM ovefiing Task 1 where dev acc was and dev loss was Oupu The final oupu is a linear combinaion of he vecor ν from he conroller and he concaenaed read vecors r 1,..., r R. r 1 y = W r. r R + ν 6.3. DNC mahemaical exercise In mos of he DNC equaions, why he auhors of he DNC paper [1] formulaed he expressions hey way hey did was inuiive, bu some of he equaions were no quie so. In our undersanding of he DNC equaions, we wen hrough some proofs o ensure we undersood he why behind he mahemaical expressions, paricularly for he usage vecor equaion, which, hrough our mahemaical exercise, we hink was formulaed o ensure u [0, 1] N. To see ha u = (u 1 +w w 1 (u 1 w w 1)) ψ [0, 1] N, observe ha a + b ab [0, 1] if a, b [0, 1]. a + b ab = a + b ab = (1 b)a + b = (1 b)a (1 b) + 1 = (1 b)(a 1) + 1 Observe ha since a, b [0, 1], 1 a 1 0 and 0 1 b 1, so 1 (1 b)(a 1) 0, which implies 0 (1 b)(a 1) Thus, by he same logic, u 1 + w w 1 (u 1 w w 1) [0, 1] and since ψ [0, 1], we mus have u [0, 1]. Figure 19: DNC overfiing on single ask experimens Train Loss Dev loss Train Acc Dev Acc Task Task

14 Figure 20: Join-ask models: Training vs dev se loss and accuracy plos Dropou experimen (afer join models) While wriing he conclusion, we had lef-over GPU ime, so due o he over-fiing on he single asks, we also ran a dropou experimen o undersand wha we would have done if we had more ime, bu he dropou experimen was only on a single ask due o compuaional and emporal limiaions. We were hypohesizing ha he inroducion of regularizaion would help he model beer generalize o unseen daa. The dropou experimen was only execued on Task 1 and he resuls were somewha promising as shown in Fig. 21 since, like in he experimen wihou dropou, he accuracy on he dev se sars fla-lining a abou 0.5, bu he difference beween he dev and rain loss plos were less drasic. However, he rain loss for he DNC wih dropou was higher han he DNC alone on all ieraions and he DNC wih dropou was aking more han wice as long o rain o ge o a higher loss han he DNC wihou dropou. Figure 21: DNC rained on Task 1 wih Dropou (a) DNC wih [1] seings (b) DNC using Adam: noice he overfiing. (c) LSTM 14

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics