CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part V Language Models, RNN, GRU and LSTM 2 Winter 2019

Size: px

Start display at page:

Download "CS224n: Natural Language Processing with Deep Learning 1 Lecture Notes: Part V Language Models, RNN, GRU and LSTM 2 Winter 2019"

Anastasia McCoy
5 years ago
Views:

1 CS224n: Naural Language Processing wih Deep Learning 1 Lecure Noes: Par V Language Models, RNN, GRU and LSTM 2 Winer Course Insrucors: Chrisopher Manning, Richard Socher 2 Auhors: Milad Mohammadi, Rohi Mundra, Richard Socher, Lisa Wang, Amia Kamah Keyphrases: Language Models. RNN. Bi-direcional RNN. Deep RNN. GRU. LSTM. 1 Language Models 1.1 Inroducion Language models compue he probabiliy of occurrence of a number of words in a paricular sequence. The probabiliy of a sequence of m words {w 1,..., w m } is denoed as P(w 1,..., w m ). Since he number of words coming before a word, w i, varies depending on is locaion in he inpu documen, P(w 1,..., w m ) is usually condiioned on a window of n previous words raher han all previous words: P(w 1,..., w m ) = i=m P(w i w 1,..., w i 1 ) i=1 i=m P(w i w i n,..., w i 1 ) (1) i=1 Equaion 1 is especially useful for speech and ranslaion sysems when deermining wheher a word sequence is an accurae ranslaion of an inpu senence. In exising language ranslaion sysems, for each phrase / senence ranslaion, he sofware generaes a number of alernaive word sequences (e.g. {I have, I had, I has, me have, me had}) and scores hem o idenify he mos likely ranslaion sequence. In machine ranslaion, he model chooses he bes word ordering for an inpu phrase by assigning a goodness score o each oupu word sequence alernaive. To do so, he model may choose beween differen word ordering or word choice alernaives. I would achieve his objecive by running all word sequence candidaes hrough a probabiliy funcion ha assigns each a score. The sequence wih he highes score is he oupu of he ranslaion. For example, he machine would give a higher score o "he ca is small" compared o "small he is ca", and a higher score o "walking home afer school" compared o "walking house afer school". 1.2 n-gram Language Models To compue he probabiliies menioned above, he coun of each n- gram could be compared agains he frequency of each word. This is

2 language models, rnn, gru and lsm 2 called an n-gram Language Model. For insance, if he model akes bi-grams, he frequency of each bi-gram, calculaed via combining a word wih is previous word, would be divided by he frequency of he corresponding uni-gram. Equaions 2 and 3 show his relaionship for bigram and rigram models. p(w 2 w 1 ) = coun(w 1, w 2 ) coun(w 1 ) (2) p(w 3 w 1, w 2 ) = coun(w 1, w 2, w 3 ) coun(w 1, w 2 ) The relaionship in Equaion 3 focuses on making predicions based on a fixed window of conex (i.e. he n previous words) used o predic he nex word. Bu how long should he conex be? In some cases, he window of pas consecuive n words may no be sufficien o capure he conex. For insance, consider he senence "As he procor sared he clock, he sudens opened heir ". If he window only condiions on he previous hree words "he sudens opened heir", he probabiliies calculaed based on he corpus may sugges ha he nex word be "books" - however, if n had been large enough o include he "procor" conex, he probabiliy migh have suggesed "exam". This leads us o wo main issues wih n-gram Language Models: Sparsiy and Sorage. 1. Sparsiy problems wih n-gram Language models Sparsiy problems wih hese models arise due o wo issues. Firsly, noe he numeraor of Equaion 3. If w 1, w 2 and w 3 never appear ogeher in he corpus, he probabiliy of w 3 is 0. To solve his, a small δ could be added o he coun for each word in he vocabulary. This is called smoohing. Secondly, consider he denominaor of Equaion 3. If w 1 and w 2 never occurred ogeher in he corpus, hen no probabiliy can be calculaed for w 3. To solve his, we could condiion on w 2 alone. This is called backoff. Increasing n makes sparsiy problems worse. Typically, n Sorage problems wih n-gram Language models We know ha we need o sore he coun for all n-grams we saw in he corpus. As n increases (or he corpus size increases), he model size increases as well. (3) 1.3 Window-based Neural Language Model The "curse of dimensionaliy" above was firs ackled by Bengio e al in A Neural Probabilisic Language Model, which inroduced he

3 language models, rnn, gru and lsm 3 firs large-scale deep learning for naural language processing model. This model learns a disribued represenaion of words, along wih he probabiliy funcion for word sequences expressed in erms of hese represenaions. Figure 1 shows he corresponding neural nework archiecure. The inpu word vecors are used by boh he hidden layer and he oupu layer. Equaion 4 represens Figure 1 and shows he parameers of he sofmax() funcion, consising of he sandard anh() funcion (i.e. he hidden layer) as well as he linear funcion, W (3) x + b (3), ha capures all he previous n inpu word vecors. ŷ = sofmax(w (2) anh(w (1) x + b (1) ) + W (3) x + b (3) ) (4) Noe ha he weigh marix W (1) is applied o he word vecors (solid green arrows in Figure 1), W (2) is applied o he hidden layer (also solid green arrow) and W (3) is applied o he word vecors (dashed green arrows). A simplified version of his model can be seen in Figure 2, where he blue layer signifies concaenaed word embeddings for he inpu words: e = [e (1) ; e (2) ; e (3) ; e (4) ], he red layer signifies he hidden layer: h = f (We + b 1 ), and he green oupu disribuion is a sofmax over he vocabulary: ŷ = sofmax(uh + b 2 ). Figure 1: The firs deep neural nework archiecure model for NLP presened by Bengio e al. 2 Recurren Neural Neworks (RNN) Unlike he convenional ranslaion models, where only a finie window of previous words would be considered for condiioning he language model, Recurren Neural Neworks (RNN) are capable of condiioning he model on all previous words in he corpus. Figure 3 inroduces he RNN archiecure where each verical recangular box is a hidden layer a a ime-sep,. Each such layer holds a number of neurons, each of which performs a linear marix operaion on is inpus followed by a non-linear operaion (e.g. anh()). A each ime-sep, here are wo inpus o he hidden layer: he oupu of he previous layer h 1, and he inpu a ha imesep x. The former inpu is muliplied by a weigh marix W (hh) and he laer by a weigh marix W (hx) o produce oupu feaures h, which are muliplied wih a weigh marix W (S) and run hrough a sofmax over he vocabulary o obain a predicion oupu ŷ of he nex word (Equaions 5 and 6). The inpus and oupus of each single neuron are illusraed in Figure 4. Figure 2: A simplified represenaion of Figure 1. y 1 y y +1 h 1 h h +1 W" W" x 1 x x +1 Figure 3: A Recurren Neural Nework (RNN). Three ime-seps are shown. h = σ(w (hh) h 1 + W (hx) x [] ) (5) ŷ = so f max(w (S) h ) (6)

4 language models, rnn, gru and lsm 4 Wha is ineresing here is ha he same weighs W (hh) and W (hx) are applied repeaedly a each imesep. Thus, he number of parameers he model has o learn is less, and mos imporanly, is independen of he lengh of he inpu sequence - hus defeaing he curse of dimensionaliy! Below are he deails associaed wih each parameer in he nework: x 1,..., x 1, x, x +1,...x T : he word vecors corresponding o a corpus wih T words. Figure 4: The inpus and oupus o a neuron of a RNN h = σ(w (hh) h 1 + W (hx) x ): he relaionship o compue he hidden layer oupu feaures a each ime-sep x R d : inpu word vecor a ime. W hx R D h d : weighs marix used o condiion he inpu word vecor, x W hh R D h D h : weighs marix used o condiion he oupu of he previous ime-sep, h 1 h 1 R D h: oupu of he non-linear funcion a he previous ime-sep, 1. h 0 R D h is an iniializaion vecor for he hidden layer a ime-sep = 0. σ(): he non-lineariy funcion (sigmoid here) ŷ = so f max(w (S) h ): he oupu probabiliy disribuion over he vocabulary a each ime-sep. Essenially, ŷ is he nex prediced word given he documen conex score so far (i.e. h 1 ) and he las observed word vecor x (). Here, W (S) R V D h and ŷ R V where V is he vocabulary. An example of an RNN language model is shown in Figure 5. The noaion in his image is slighly differen: here, he equivalen of W (hh) is W h, W (hx) is W e, and W (S) is U. E convers word inpus x () o word embeddings e (). The final sofmax over he vocabulary shows us he probabiliy of various opions for oken x (5), condiioned on all previous okens. The inpu could be much longer han 4-5 okens. Figure 5: An RNN Language Model 2.1 RNN Loss and Perplexiy The loss funcion used in RNNs is ofen he cross enropy error inroduced in earlier noes. Equaion 7 shows his funcion as he sum over he enire vocabulary a ime-sep. J () (θ) = V j=1 y,j log(ŷ,j ) (7)

5 language models, rnn, gru and lsm 5 The cross enropy error over a corpus of size T is: J = 1 T T J () (θ) = 1 T =1 T =1 V j=1 y,j log(ŷ,j ) (8) Equaion 9 is called he perplexiy relaionship; i is basically 2 o he power of he negaive log probabiliy of he cross enropy error funcion shown in Equaion 8. Perplexiy is a measure of confusion where lower values imply more confidence in predicing he nex word in he sequence (compared o he ground ruh oucome). Perplexiy = 2 J (9) 2.2 Advanages, Disadvanages and Applicaions of RNNs RNNs have several advanages: 1. They can process inpu sequences of any lengh 2. The model size does no increase for longer inpu sequence lenghs 3. Compuaion for sep can (in heory) use informaion from many seps back. 4. The same weighs are applied o every imesep of he inpu, so here is symmery in how inpus are processed However, hey also have some disadvanages: 1. Compuaion is slow - because i is sequenial, i canno be parallelized 2. In pracice, i is difficul o access informaion from many seps back due o problems like vanishing and exploding gradiens, which we discuss in he following subsecion The amoun of memory required o run a layer of RNN is proporional o he number of words in he corpus. We can consider a senence as a minibach, and a senence wih k words would have k word vecors o be sored in memory. Also, he RNN mus mainain wo pairs of W, b marices. As aforemenioned, while he size of W could be very large, i does no scale wih he size of he corpus (unlike he radiional language models). For a RNN wih 1000 recurren layers, he marix would be regardless of he corpus size. RNNs can be used for many asks, such as agging (e.g. par-ofspeech, named eniy recogniion), senence classificaion (e.g. senimen classificaion), and encoder modules (e.g. quesion answering,

6 language models, rnn, gru and lsm 6 machine ranslaion, and many oher asks). In he laer wo applicaions, we wan a represenaion for he senence, which we can obain by aking he elemen-wise max or mean of all hidden saes of he imeseps in ha senence. Noe: Figure 6 is an alernaive represenaion of RNNs used in some publicaions. I represens he RNN hidden layer as a loop. 2.3 Vanishing Gradien & Gradien Explosion Problems Recurren neural neworks propagae weigh marices from one imesep o he nex. Recall he goal of a RNN implemenaion is o enable propagaing conex informaion hrough faraway ime-seps. For example, consider he following wo senences: h Figure 6: The illusraion of a RNN as a loop over ime-seps Senence 1 "Jane walked ino he room. John walked in oo. Jane said hi o " Senence 2 "Jane walked ino he room. John walked in oo. I was lae in he day, and everyone was walking home afer a long day a work. Jane said hi o " In boh senences, given heir conex, one can ell he answer o boh blank spos is mos likely "John". I is imporan ha he RNN predics he nex word as "John", he second person who has appeared several ime-seps back in boh conexs. Ideally, his should be possible given wha we know abou RNNs so far. In pracice, however, i urns ou RNNs are more likely o correcly predic he blank spo in Senence 1 han in Senence 2. This is because during he back-propagaion phase, he conribuion of gradien values gradually vanishes as hey propagae o earlier imeseps, as we will show below. Thus, for long senences, he probabiliy ha "John" would be recognized as he nex word reduces wih he size of he conex. Below, we discuss he mahemaical reasoning behind he vanishing gradien problem. Consider Equaions 5 and 6 a a ime-sep ; o compue he RNN error, de/dw, we sum he error a each ime-sep. Tha is, de /dw for every ime-sep,, is compued and accumulaed. T E W = E W =1 (10) The error for each ime-sep is compued hrough applying he chain rule differeniaion o Equaions 6 and 5; Equaion 11 shows he corresponding differeniaion. Noice dh /dh k refers o he parial

7 language models, rnn, gru and lsm 7 derivaive of h wih respec o all previous k ime-seps. E W = E y h h k y k=1 h h k W (11) Equaion 12 shows he relaionship o compue each dh /dh k ; his is simply a chain rule differeniaion over all hidden layers wihin he [k, ] ime inerval. h h k = j=k+1 h j h j 1 = W T diag[ f (j j 1 )] (12) j=k+1 Because h R D n, each h j / h j 1 is he Jacobian marix for h: h j,1... h j 1,1 h j h j h... j = [... ] =... h j 1 h j 1,1 h j 1,Dn... h j,dn... h j 1,1 h j,1 h j 1,Dn h j,dn h j 1,Dn (13) Puing Equaions 10, 11, 12 ogeher, we have he following relaionship. T E W = E y h j ( y =1 h ) h k (14) h j 1 W k=1 j=k+1 Equaion 15 shows he norm of he Jacobian marix relaionship in Equaion 13. Here, β W and β h represen he upper bound values for he wo marix norms. The norm of he parial gradien a each ime-sep,, is herefore, calculaed hrough he relaionship shown in Equaion 15. h j h j 1 W T diag[ f (h j 1 )] β W β h (15) The norm of boh marices is calculaed hrough aking heir L2- norm. The norm of f (h j 1 ) can only be as large as 1 given he sigmoid non-lineariy funcion. h h k = j=k+1 h j h j 1 (β W β h ) k (16) The exponenial erm (β W β h ) k can easily become a very small or large number when β W β h is much smaller or larger han 1 and k is sufficienly large. Recall a large k evaluaes he cross enropy error due o faraway words. The conribuion of faraway words o predicing he nex word a ime-sep diminishes when he gradien vanishes early on.

language models, rnn, gru and lsm 8 During experimenaion, once he gradien value grows exremely large, i causes an overflow (i.e. NaN) which is easily deecable a runime; his issue is called he Gradien Explosion Problem.

8 language models, rnn, gru and lsm 8 During experimenaion, once he gradien value grows exremely large, i causes an overflow (i.e. NaN) which is easily deecable a runime; his issue is called he Gradien Explosion Problem. When he gradien value goes o zero, however, i can go undeeced while drasically reducing he learning qualiy of he model for far-away words in he corpus; his issue is called he Vanishing Gradien Problem. Due o vanishing gradiens, we don know wheher here is no dependency beween seps and + n in he daa, or we jus canno capure he rue dependency due o his issue. To gain pracical inuiion abou he vanishing gradien problem, you may visi he following example websie. 2.4 Soluion o he Exploding & Vanishing Gradiens Now ha we gained inuiion abou he naure of he vanishing gradiens problem and how i manifess iself in deep neural neworks, le us focus on a simple and pracical heurisic o solve hese problems. To solve he problem of exploding gradiens, Thomas Mikolov firs inroduced a simple heurisic soluion ha clips gradiens o a small number whenever hey explode. Tha is, whenever hey reach a cerain hreshold, hey are se back o a small number as shown in Algorihm 1. ĝ E W if ĝ hreshold hen ĝ hreshold ĝ ĝ end if Algorihm 1: Pseudo-code for norm clipping in he gradiens whenever hey explode Figure 7 visualizes he effec of gradien clipping. I shows he decision surface of a small recurren neural nework wih respec o is W marix and is bias erms, b. The model consiss of a single uni of recurren neural nework running hrough a small number of imeseps; he solid arrows illusrae he raining progress on each gradien descen sep. When he gradien descen model his he high error wall in he objecive funcion, he gradien is pushed off o a far-away locaion on he decision surface. The clipping model produces he dashed line where i insead pulls back he error gradien o somewhere close o he original gradien landscape. To solve he problem of vanishing gradiens, we inroduce wo echniques. The firs echnique is ha insead of iniializing W (hh) randomly, sar off from an ideniy marix iniializaion. The second echnique is o use he Recified Linear Unis (ReLU) insead of he sigmoid funcion. The derivaive for he ReLU is eiher 0 or Figure 7: Gradien explosion clipping visualizaion

9 language models, rnn, gru and lsm 9 1. This way, gradiens would flow hrough he neurons whose derivaive is 1 wihou geing aenuaed while propagaing back hrough ime-seps. 2.5 Deep Bidirecional RNNs So far, we have focused on RNNs ha condiion on pas words o predic he nex word in he sequence. I is possible o make predicions based on fuure words by having he RNN model read hrough he corpus backwards. Irsoy e al. shows a bi-direcional deep neural nework; a each ime-sep,, his nework mainains wo hidden layers, one for he lef-o-righ propagaion and anoher for he righo-lef propagaion. To mainain wo hidden layers a any ime, his nework consumes wice as much memory space for is weigh and bias parameers. The final classificaion resul, ŷ, is generaed hrough combining he score resuls produced by boh RNN hidden layers. Figure 8 shows he bi-direcional nework archiecure, and Equaions 17 and 18 show he mahemaical formulaion behind seing up he bidirecional RNN hidden layer. The only difference beween hese wo relaionships is in he direcion of recursing hrough he corpus. Equaion 19 shows he classificaion relaionship used for predicing he nex word via summarizing pas and fuure word represenaions. Bidirecionaliy y h h! = f (W!"! x +V!" h! = f (W!" x +V!" y = g(u[h! ;h! ] h = f ( W x + V h 1 + b ) (17) h = f ( W x + V h +1 + b ) (18) x h = [h! ;h! ] now represens (summarizes) he pas and Figure 8: A bi-direcional RNN model around a single oken. ŷ = g(uh + c) = g(u[ h ; h ] + c) (19) RNNs can also be muli-layered. Figure 9 shows a muli-layer bidirecional RNN where each lower layer feeds he nex layer. As shown in his figure, in his nework archiecure, a ime-sep each inermediae neuron receives one se of parameers from he previous imesep (in he same RNN layer), and wo ses of parameers from he previous RNN hidden layer; one inpu comes from he lef-o-righ RNN and he oher from he righ-o-lef RNN. To consruc a Deep RNN wih L layers, he above relaionships are modified o he relaionships in Equaions 20 and 21 where he inpu o each inermediae neuron a level i is he oupu of he RNN a layer i 1 a he same ime-sep,. The oupu, ŷ, a each ime-sep is he resul of propagaing inpu parameers hrough all hidden layers (Equaion 22). h (i) = f ( W (i) h (i 1) + V (i) h (i) 1 + b (i) ) (20) Going Deep y h (3) h (2) h (1) x h! (i) = f (W!"! (i) h (i 1 h! (i) = f (W!" (i)h (i 1 y = g(u[h! (L)! ;h Each memory layer passes an inermediae seq Figure represenaion 9: A deep bi-direcional o he nex. RNN wih hree RNN layers.

10 language models, rnn, gru and lsm 10 h (i) = f ( W (i) h (i 1) + V (i) h (i) +1 + b (i) ) (21) ŷ = g(uh + c) = g(u[ h (L) ; h (L) ] + c) (22) 2.6 Applicaion: RNN Translaion Model Tradiional ranslaion models are quie complex; hey consis of numerous machine learning algorihms applied o differen sages of he language ranslaion pipeline. In his secion, we discuss he poenial for adoping RNNs as a replacemen o radiional ranslaion modules. Consider he RNN example model shown in Figure 10; here, he German phrase Ech dicke Kise is ranslaed o Awesome sauce. The firs hree hidden layer ime-seps encode he German language words ino some language word feaures (h 3 ). The las wo ime-seps decode h 3 ino English word oupus. Equaion 23 shows he relaionship for he Encoder sage and Equaions 24 and 25 show he equaion for he Decoder sage. h = φ(h 1, x ) = f (W (hh) h 1 + W (hx) x ) (23) h = φ(h 1 ) = f (W (hh) h 1 ) (24) h 1 h 2 h 3 W x 1 x 2 x 3 W Awesome sauce y 1 y 2 Figure 10: A RNN-based ranslaion model. The firs hree RNN hidden layers belong o he source language model encoder, and he las wo belong o he desinaion language model decoder. y = so f max(w (S) h ) (25) One may naively assume his RNN model along wih he crossenropy funcion shown in Equaion 26 can produce high-accuracy ranslaion resuls. In pracice, however, several exensions are o be added o he model o improve is ranslaion accuracy performance. max θ 1 N N log p θ (y (n) x (n) ) (26) n=1 Exension I: rain differen RNN weighs for encoding and decoding. This decouples he wo unis and allows for more accuracy predicion of each of he wo RNN modules. This means he φ() funcions in Equaions 23 and 24 would have differen W (hh) marices. Exension II: compue every hidden sae in he decoder using hree differen inpus: The previous hidden sae (sandard) Las hidden layer of he encoder (c = h T in Figure 11) Figure 11: Language model wih hree inpus o each decoder neuron: (h 1, c, y 1 )

11 language models, rnn, gru and lsm 11 Previous prediced oupu word, ŷ 1 Combining he above hree inpus ransforms he φ funcion in he decoder funcion of Equaion 24 o he one in Equaion 27. Figure 11 illusraes his model. h = φ(h 1, c, y 1 ) (27) Exension III: rain deep recurren neural neworks using muliple RNN layers as discussed earlier in his chaper. Deeper layers ofen improve predicion accuracy due o heir higher learning capaciy. Of course, his implies a large raining corpus mus be used o rain he model. Exension IV: rain bi-direcional encoders o improve accuracy similar o wha was discussed earlier in his chaper. Exension V: given a word sequence A B C in German whose ranslaion is X Y in English, insead of raining he RNN using A B C X Y, rain i using C B A X Y. The inuiion behind his echnique is ha A is more likely o be ranslaed o X. Thus, given he vanishing gradien problem discussed earlier, reversing he order of he inpu words can help reduce he error rae in generaing he oupu phrase. 3 Gaed Recurren Unis Beyond he exensions discussed so far, RNNs have been found o perform beer wih he use of more complex unis for acivaion. So far, we have discussed mehods ha ransiion from hidden sae h 1 o h using an affine ransformaion and a poin-wise nonlineariy. Here, we discuss he use of a gaed acivaion funcion hereby modifying he RNN archiecure. Wha moivaes his? Well, alhough RNNs can heoreically capure long-erm dependencies, hey are very hard o acually rain o do his. Gaed recurren unis are designed in a manner o have more persisen memory hereby making i easier for RNNs o capure long-erm dependencies. Le us see mahemaically how a GRU uses h 1 and x o generae he nex hidden sae h. We will hen dive ino he inuiion of his archiecure. z = σ(w (z) x + U (z) h 1 ) r = σ(w (r) x + U (r) h 1 ) h = anh(r Uh 1 + Wx ) h = (1 z ) h + z h 1 (Updae gae) (Rese gae) (New memory) (Hidden sae)

language models, rnn, gru and lsm 12 The above equaions can be hough of a GRU s four fundamenal operaional sages and hey have inuiive inerpreaions ha make his model much more inellecually saisfying

12 language models, rnn, gru and lsm 12 The above equaions can be hough of a GRU s four fundamenal operaional sages and hey have inuiive inerpreaions ha make his model much more inellecually saisfying (see Figure 12): 1. New memory generaion: A new memory h is he consolidaion of a new inpu word x wih he pas hidden sae h 1. Anhropomorphically, his sage is he one who knows he recipe of combining a newly observed word wih he pas hidden sae h 1 o summarize his new word in ligh of he conexual pas as he vecor h. 2. Rese Gae: The rese signal r is responsible for deermining how imporan h 1 is o he summarizaion h. The rese gae has he abiliy o compleely diminish pas hidden sae if i finds ha h 1 is irrelevan o he compuaion of he new memory. 3. Updae Gae: The updae signal z is responsible for deermining how much of h 1 should be carried forward o he nex sae. For insance, if z 1, hen h 1 is almos enirely copied ou o h. Conversely, if z 0, hen mosly he new memory h is forwarded o he nex hidden sae. 4. Hidden sae: The hidden sae h is finally generaed using he pas hidden inpu h 1 and he new memory generaed h wih he advice of he updae gae. I is imporan o noe ha o rain a GRU, we need o learn all he differen parameers: W, U, W (r), U (r), W (z), U (z). These follow he same backpropagaion procedure we have seen in he pas. Figure 12: The deailed inernals of a GRU

language models, rnn, gru and lsm 13 4 Long-Shor-Term-Memories Long-Shor-Term-Memories are anoher ype of complex acivaion uni ha differ a lile from GRUs.

13 language models, rnn, gru and lsm 13 4 Long-Shor-Term-Memories Long-Shor-Term-Memories are anoher ype of complex acivaion uni ha differ a lile from GRUs. The moivaion for using hese is similar o hose for GRUs however he archiecure of such unis does differ. Le us firs ake a look a he mahemaical formulaion of LSTM unis before diving ino he inuiion behind his design: i = σ(w (i) x + U (i) h 1 ) f = σ(w ( f ) x + U ( f ) h 1 ) o = σ(w (o) x + U (o) h 1 ) c = anh(w (c) x + U (c) h 1 ) c = f c 1 + i c h = o anh(c ) (Inpu gae) (Forge gae) (Oupu/Exposure gae) (New memory cell) (Final memory cell) Figure 13: The deailed inernals of a LSTM

14 language models, rnn, gru and lsm 14 We can gain inuiion of he srucure of an LSTM by hinking of is archiecure as he following sages: 1. New memory generaion: This sage is analogous o he new memory generaion sage we saw in GRUs. We essenially use he inpu word x and he pas hidden sae h 1 o generae a new memory c which includes aspecs of he new word x (). 2. Inpu Gae: We see ha he new memory generaion sage doesn check if he new word is even imporan before generaing he new memory his is exacly he inpu gae s funcion. The inpu gae uses he inpu word and he pas hidden sae o deermine wheher or no he inpu is worh preserving and hus is used o gae he new memory. I hus produces i as an indicaor of his informaion. 3. Forge Gae: This gae is similar o he inpu gae excep ha i does no make a deerminaion of usefulness of he inpu word insead i makes an assessmen on wheher he pas memory cell is useful for he compuaion of he curren memory cell. Thus, he forge gae looks a he inpu word and he pas hidden sae and produces f. 4. Final memory generaion: This sage firs akes he advice of he forge gae f and accordingly forges he pas memory c 1. Similarly, i akes he advice of he inpu gae i and accordingly gaes he new memory c. I hen sums hese wo resuls o produce he final memory c. 5. Oupu/Exposure Gae: This is a gae ha does no explicily exis in GRUs. I s purpose is o separae he final memory from he hidden sae. The final memory c conains a lo of informaion ha is no necessarily required o be saved in he hidden sae. Hidden saes are used in every single gae of an LSTM and hus, his gae makes he assessmen regarding wha pars of he memory c needs o be exposed/presen in he hidden sae h. The signal i produces o indicae his is o and his is used o gae he poin-wise anh of he memory.

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Deep Learning: Theory, Techniques & Applicaions - Recurren Neural Neworks - Prof. Maeo Maeucci maeo.maeucci@polimi.i Deparmen of Elecronics, Informaion and Bioengineering Arificial Inelligence and Roboics