Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Size: px

Start display at page:

Download "Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs"

Emil Kelly
5 years ago
Views:

1 Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata and Frank Keller 26 January 2018 Scool of Informatics University of Edinburg Backpropagation Troug Time Long sort-term memory Reading: Mikolov et al (2010), Ola (2015). 1 2 Most models in NLP are probabilistic models E.g. language model decomposed wit cain rule of probability. k P(w 1 w k ) = P(w i w 1,, w i 1 ) i=1 Recap: probability, language models, and feedforward networks Modeling decision: Markov assumption P(w i w 1,, w i 1 ) P(w i w i n1,, w i 1 ) Rules of probability (remember: vocabulary V is finite) P : V R P(w w i n1,, w i 1 ) = 1 w V 3

2 MLPs (aka deep NNs) are functions from a vector to a vector Probability distributions are vectors! Summer is ot winter is Wat functions can we use? Matrix multiplication: convert an m-element vector to an n-element vector. Parameters are usually of tis form. Sigmoid, exp, tan, RELU, etc: elementwise nonlinear transform from m-element vector to m-element vector. Concatenate an m-element and n-element vector into an (m n)-element vector. Multiple functions can also sare and substructure. cold 0.6 grey 0.3 winter 0.1 is 0 ot 0 summer 0 4 Softmax will convert any vector to a probability distribution. 5 Elements of discrete vocabularies are vectors! Feedforward LM: function from a vectors to a vector Summer is ot winter is is cold grey ot summer winter Use one-ot encoding to represent any element of a finite set. 6 7

3 How muc context do we need? Te roses are red. Te roses in te vase are red. Te roses in te vase by te door are red. Te roses in te vase by te door to te kitcen are red. Simple Recurrent Networks Captain Aab nursed is grudge for many years before seeking te Wite Donald Trump nursed is grudge for many years before seeking te Wite 8 Modeling Context Context is important in language modeling: n-gram language models use a limited context (fixed n); feedforward networks can be used for language modeling, but teir is also of fixed size; but linguistic dependencies can be arbitrarily long. Tis is were neural networks come in: te of an RNN includes a copy of te previous idden layer of te network; effectively, te RNN buffers all te s it as seen before; it can tus model context dependencies of arbitrary lengt. We will look at simple networks first. Arcitecture Te simple networks only looks back one time step: x(t) s(t-1) V U s(t) W y(t) 9 10

4 Arcitecture Input and Output We ave layer x, idden layer s (state), output layer y. Te at time t is x(t), output is y(t), and idden layer s(t). s j (t) = f (net j (t)) (1) net j (t) = l x i (t)v ji s (t 1)u j (2) i y k (t) = g(net k (t)) (3) net k (t) = s j (t)w kj (4) j were f (z) is te sigmoid, and g(z) te softmax function: f (z) = 1 1 e z g(z m ) = ezm k ez k For initialization, set s and x to small random values; for eac time step, copy s(t 1) and use it to compute s(t); vector x(t) uses 1-of-N (one ot) encoding over te words in te vocabulary; output vector y(t) is a probability distribution over te next word given te current word w(t) and context s(t 1); size of idden layer is usually units, depending on size of training data Training We can use standard backprop wit stocastic gradient descent: simply treat te network as a feedforward network wit s(t 1) as additional ; backpropagate te error to adjust weigt matrices U and V; present all of te training data in eac epoc; test on validation data to see if log-likeliood of training data improves; adjust learning rate if necessary. Backpropagation Troug Time Error signal for training: error(t) = desired(t) y(t) were desired(t) is te one-ot encoding of te correct next word. 13

5 From Simple to Full RNNs Arcitecture Te full RNN looks at all te previous time steps: x(t) Let s drop te assumption tat only te idden layer from te previous time step is used; instead use all previous time steps; we can tink of tis as unfolding over time: te RNN is unfolded into a sequence of feedforward networks; we need a new learning algoritm: backpropagation troug time (BPTT). x(t-2) V U x(t-1) s(t-2) V U s(t-1) V U s(t) W y(t) s(t-3) Standard Backpropagation Going Back in Time For output units, we update te weigts W using: n w kj = η δ pk s pj δ pk = (d pk y pk )g (net pk ) p were d pk is te desired output of unit k for training pattern p. For idden units, we update te weigts V using: n o v ji = η δ pj x pi δ pj = δ pk w kj f (net pj ) p k Tis is just standard backprop, wit notation adjusted for RNNs! If we only go back one time step, ten we can update te weigts U using te standard delta rule: u ji = η n δ pj (t)s p (t 1) δ pj (t) = p o δ pk w kj f (net pj ) However, if we go furter back in time, ten we need to apply te delta rule to te previous time step as well: δ pj (t 1) = δ p (t)u j f (s pj (t 1)) were is te index for te idden unit at time step t, and j for te idden unit at time step t 1. k 16 17

Going Back in Time We can do tis for an arbitrary number of time steps τ, adding up te resulting deltas to compute u ji. Te RNN effectively becomes a deep network of dept τ.

6 Going Back in Time We can do tis for an arbitrary number of time steps τ, adding up te resulting deltas to compute u ji. Te RNN effectively becomes a deep network of dept τ. For language modeling, Mikolov et al. sow tat increased τ improves performance. As we backpropagate troug time, gradients tend toward 0 We adjust U using backprop troug time. For timestep t: n o u ji = η δ pj (t)s p (t 1) δ pj (t) = δ pk w kj f (net pj ) p k For timestep t 1: δ pj (t 1) = δ p (t)u j f (s pj (t 1)) For time step t 2: δ pj (t 2) = δ p (t 1)u j f (s pj (t 2)) = δ p1 (t)u 1 jf (s pj (t 1))u j f (s pj (t 2)) As we backpropagate troug time, gradients tend toward 0 As we backpropagate troug time, gradients tend toward 0 At every time step, we multiply te weigts wit anoter gradient. Te gradients are < 1 so te deltas become smaller and smaller. So in fact, te RNN is not able to learn long-range dependencies well, as te gradient vanises: it rapidly forgets previous s: [Source: ttps://teclevermacine.wordpress.com/] 20 [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] 21

7 A better RNN: Long Sort-term Memory Solution: network can sometimes pass on information from previous time steps uncanged, so tat it can learn from distant s: Long sort-term memory 22 Arcitecture of te LSTM Te Gates and te Memory Cell To acieve tis, we need to make te units of te network more complicated: LSTMs ave a idden layer of memory blocks; eac block contains a memory cell and tree multiplicative units: te, output and forget gates; te gates are trainable: eac block can learn weter to keep information across time steps or not. In contrast, te RNN uses simple idden units, wic just sum te and pass it troug an activation function. Eac memory block consists of four units: [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] O: open gate --: closed gate black: ig activation wite: low activation Input gate: controls weter te to is passed on to te memory cell or ignored; Output gate: controls weter te current activation vector of te memory cell is passed on to te output layer or not; Forget gate: controls weter te activation vector of te memory cell is reset to zero or maintained; Memory cell: stores te current activation vector; wit connection to itself controlled by forget gate. Tere are also peepole connections; we won t discuss tese

A Single LSTM Memory Block RNN Unit compared to LSTM Memory Block SRN unit output g block output LSTM block peepoles cell forget gate block output g output gate gate Legend unweigted connection

(usually tan) [Source: Klaus Greff et al.: LSTM: A Searc Space Odyssey, 2015.] [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.

8 A Single LSTM Memory Block RNN Unit compared to LSTM Memory Block SRN unit output g block output LSTM block peepoles cell forget gate block output g output gate gate Legend unweigted connection weigted connection connection wit time-lag brancing point mutliplication sum over all s g gate activation function (always sigmoid) activation function (usually tan) output activation function (usually tan) [Source: Klaus Greff et al.: LSTM: A Searc Space Odyssey, 2015.] [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] Te Gates and te Memory Cell Putting LSTM Memory Blocks Togeter Gates are regular idden units: tey sum teir and pass it troug a sigmoid activation function; all four s to te block are te same: te layer and te layer (idden layer at previous time step); all gates ave multiplicative connections: if te activation is close to zero, ten te gate doesn t let anyting troug; te memory cell itself is linear: it as no activation function; but te block as a wole as and output activation functions (can be tan or sigmoid); all connections witin te block are unweigted: tey just pass on information (i.e., copy te incoming vector); te only output tat te rest of te network sees is wat te output gate lets troug. 27 Network wit four units, a idden layer of two memory blocks and five output units: [Source: Graves, Supervised Sequence Labelling wit RNNs, 2012.] 28

9 Vanising Gradients Again Wy does tis solve te vanising gradient problem? te memory cell is linear, so its gradient doesn t vanis; an LSTM block can retain information indefinitely: if te forget gate is open (close to 1) and te gate is closed (close to 0), ten te activation of te cell persists; in addition, te block can decide wen to output information by opening te output gate; te block can terefore retain information over an arbitrary number of time steps before it outputs it; te block learns wen to accept, produce output, and forget information: te gates ave trainable weigts. Applications LSTMs are useful for lots of sequence labeling tasks: part of speec tagging and parsing; semantic role labeling; opinion mining. Wit modification, also widely used for sequence-to-sequence problems: macine translation question answering; summarization; sentence compression and simplification. We will see some of tese applications in te rest of te course Summary Recurrent networks encode a complete sequence. RNNs can be trained wit standard backprop. We can also unfold an RNN over time and train it wit backpropagation troug time; Turns te RNN into a deep network; even better language modeling performance. Backprop troug time wit RNNs as te problem tat gradients vanis wit increasing timesteps. Te LSTM is a way of addressing tis problem. It replaces additive idden units wit complex memory blocks. 31

Notes on Neural Networks

Notes on Neural Networks Artificial neurons otes on eural etwors Paulo Eduardo Rauber 205 Consider te data set D {(x i y i ) i { n} x i R m y i R d } Te tas of supervised learning consists on finding a function f : R m R d tat