Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Size: px

Start display at page:

Download "Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs"

Angel McKenzie
5 years ago
Views:

1 Recurrent Neural Networks (RNNs) RNNs are a family of networks for processing sequential data. Lecture 9 - Networks for Sequential Data RNNs & LSTMs DD2424 September 6, 2017 A RNN applies the same function recursively when traversing network s graph structure. RNN encodes a sequence x1, x2,..., xτ into fixed length hidden vector hτ. The size of hτ is independent of τ. Amazingly flexible powerful high-level architecture. RNN with no outputs RNN with no outputs delay of 1 time step h h x x Graph displays processing of information for each time step. Information from input x is incorporated into state h. Graph displays processing of information for each time step. Information from input x is incorporated into state h. State h is passed forward in time.

2 RNN with no outputs RNN: How hidden states generated Most recurrent neural networks have a function f ht = f(, ; θ) h.. ht ht+1 h nrolled visualization of the RNN that defines their hidden state over time where - ht is the hidden state at time t (a vector) - is the input vector at time t - θ is the parameters of f. θ remains constant as t changes. Apply the same function with the same parameter values at each iteration. RNN: How hidden states generated RNN with no outputs Most recurrent neural networks have a function f ht = f(, ; θ) that defines their hidden state over time where - ht is the hidden state at time t (a vector) - is the input vector at time t - θ is the parameters of f. θ remains constant as t changes. Apply the same function with the same parameter values at each iteration. f f f f h.. ht ht+1 h nrolled visualization of the RNN

RNN with outputs RNN with outputs o ot 1 ot ot+1 h delay of 1 time step h.. ht ht+1 h.

an output vector at each time step Back to anilla RNN se cases of RNNs The state consists of a single hidden vector

Output Generation Representation/ Prediction Encoding/decoding Continuous prediction For t = 1,.

3 RNN with outputs RNN with outputs o ot 1 ot ot+1 h delay of 1 time step h.. ht ht+1 h x nrolled visualization of the RNN sually also predict an output vector at each time step sually also predict an output vector at each time step Back to anilla RNN se cases of RNNs The state consists of a single hidden vector ht: Initial hidden state h0 is assumed given. Output Generation Representation/ Prediction Encoding/decoding Continuous prediction For t = 1,..., τ the RNN equations are at = + + b ht = tanh(at) RNN ot = ht + c pt = softmax(ot) Input Network s input - h0 initial hidden state has size m 1 [ - input vector at time t has size d 1

anilla RNN equations anilla RNN equations The state consists of a single hidden vector h: The state consists of a single hidden vector ht : Initial hidden state h0 is assumed given.

4 anilla RNN equations anilla RNN equations The state consists of a single hidden vector h: The state consists of a single hidden vector ht : Initial hidden state h0 is assumed given. Initial hidden state h0 is assumed given. For t = 1,..., T the RNN equations are For t = 1,..., τ the RNN equations are at = + + b at = + + b ht = tanh(at ) ht = tanh(at ) o t = ht + c o t = ht + c pt = softmax(ot ) pt = softmax(ot ) Parameters of the network Network s output hidden vectors - weight matrix of size m m applied to (hidden-to-hidden connection) - at hidden state at time t of size m 1 before non-linearity - weight matrix of size m d applied to (input-to-hidden connection) - ht hidden state at time t of size m 1 - ot output vector (of unnormalized log probabilities for each class) at time t of size C 1 - pt output probability vector at time t of size C 1 RNN ocabulary: [h,e,l,o] x Example training sequence: hello Fei-Fei Li & Andrej Karpathy & Justin Johnson - weight matrix of size C m applied to at (hidden-to-output connection) - c bias vector of size C 1 in equation for ot y Character-level language model example - b bias vector of size m 1 in equation for at Character-level language model example ocabulary: [h,e,l,o] Example training sequence: hello Lecture Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016

Character-level language model example Character-level language model example ocabulary: [h,e,l,o] ocabulary: [h,e,l,o] Example training sequence: hello Example training sequence: hello Fei-Fei Li &

5 Character-level language model example Character-level language model example ocabulary: [h,e,l,o] ocabulary: [h,e,l,o] Example training sequence: hello Example training sequence: hello Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 y Eend this simple approach to full alphabet punctuation characters RNN x Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016

at first: train more train more train more Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 10-35

vanilla RNN? Supervised learning via a loss function & mini-batch gradient descent.

For each in sequence have a target output yt. Define loss lt between the yt pt for each t.

6 at first: train more train more train more Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 How do we train a vanilla RNN? Supervised learning via a loss function & mini-batch gradient descent. Loss defined for one training sequence. Have a sequence x1, x2,..., xτ of input vectors. For each in sequence have a target output yt. Define loss lt between the yt pt for each t. Sum the loss over all time-steps L(x1,..., xτ, y1,..., yτ,,,, b, c) = τ X Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture Feb 2016 lt

7 log(y T 1 p1) log(yt τ 2 pτ 2) log(yt t pτ 1) log(yt τ pτ ) L h1 2 Common to use the cross-entropy loss: Loss function for a vanilla RNN lt = log(y T t pt) Computational Graph for vanilla RNN loss L y1 yτ 2 yτ 1 yτ thus L(x1:τ, y1:τ,,,, b, c) = log(yt T pt) where x1:τ = {x1,..., xτ } y1:τ = {y1,..., yτ }. To implement mini-batch gradient descent need to compute (x1:τ, y1:τ,,,, b, c) (x1:τ, y1:τ,,,, b, c),, You ve guessed it, use back-prop... p1 pτ 2 pτ 1 pτ o1 oτ 2 oτ 1 oτ h1 2 h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ a1 aτ 2 aτ 1 aτ Loss for one labelled training sequence x1,... xτ Bias vectors have been omitted for clarity. Gradient of loss for the cross-entropy & softmax layers Know from prior dealings with cross-entropy loss: for t = 1,... τ Back-prop for a vanilla RNN lt pt = 1 = lt lt pt = yt t y T t pt h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = pt pt = yt ( t yt T diag(pt) ptp T pt t )

8 L h1 2 L h1 2 L h1 2 L h1 2 Gradient of loss w.r.t. Gradient of loss w.r.t. Children of node are o1, o2,..., oτ. Thus Children of node are o1, o2,..., oτ. Thus vec( ) = vec( ) vec( ) = vec( ) Know Know ( ) ot = ht = ot = IC h T t vec( ) ( ) ot = ht = ot = IC h T t vec( ) h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = vec( ) = IC ht t h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = vec( ) = IC ht t From prior reshapings know: From prior reshapings know: = g T t h T t = g T t h T t gradient needed for training network where gt =. where gt =. Gradient of loss w.r.t. h τ Gradient of loss w.r.t. h t hτ (last hidden state) has one child oτ thus If 1 t τ 1 then ht has children ot at+1 = oτ hτ oτ hτ = + +1 ht ht +1 ht Know Know oτ = = oτ = hτ ot = ht = = ht h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ Thus h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ = hτ oτ at+1 = ht + +1 = +1 = ht Thus = + ht +1

9 L h1 2 L h1 2 L h1 2 L h1 2 Gradient of loss w.r.t. h t Gradient of loss w.r.t. a t If 1 t τ 1 then ht has children ot at+1 The gradient w.r.t. at = + +1 ht ht +1 ht = ht ht Know Know h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ ot = ht = = ht h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ ht = tanh(at) = ht = diag ( tanh (at) ) = diag ( 1 tanh 2 (at) ) at+1 = ht + +1 = +1 = ht Thus Thus = diag ( 1 tanh 2 (at) ) ht = + ht +1 Have two different time steps in expression = must iterate backwards in time to compute all ht Recursively compute gradients for all a t h t Gradient of loss w.r.t. h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ Assume calculated for 1 t τ. Calculate = = diag ( 1 tanh 2 (aτ ) ) & hτ oτ aτ hτ for t = τ 1, τ 2,..., 1 - Compute = + ht +1 - Compute = diag ( 1 tanh 2 (at) ) ht h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ The gradient of the loss w.r.t. node. Children of are a1,... aτ thus Know vec( ) = at = + From prior reshapings know: = vec( ) = at = (Im h T t 1)vec( ) + = vec( ) = Im ht t 1 g T t ht t 1 where gt =.

10 L h1 2 L h1 2 L h1 2 Gradient of loss w.r.t. Gradient of loss w.r.t. The gradient of the loss w.r.t. node. The gradient of the loss w.r.t. node. Children of are a1,... aτ thus Children of are a1,... aτ thus vec( ) = vec( ) vec() = vec() Know Know h1 + x2 at = + = at = (Im h T t 1)vec( ) + h1 + x2 at = + = at = + (Im x T t )vec() h0 + x1 hτ 2 + xτ 1 hτ 1 + xτ = vec( ) = Im ht t 1 h0 + x1 hτ 2 + xτ 1 hτ 1 + xτ = vec() = Im xt t From prior reshapings know: From prior reshapings know: = gt T ht t 1 gradient needed for training network = g T t xt t where gt =. where gt =. Gradient of loss w.r.t. The gradient of the loss w.r.t. node. Children of are a1,... aτ thus Know vec() = vec() RNNs in Translation Applications h0 + x1 h1 + x2 hτ 2 + xτ 1 hτ 1 + xτ at = + = at = + (Im x T t )vec() = vec() = Im xt t From prior reshapings know: = g T t xt t gradient needed for training network where gt =.

Language translation Given a sentence in on language translate it to another language RNN-based

of la of plage Encoder-Decoder Architecture [Cho et al.

EMNLP 2014] RNN-based Sentence Generation (Decoder) Conte P(ne word is dog ) P(ne word is on )

11 Language translation Given a sentence in on language translate it to another language RNN-based Sentence Representation (Encoder) le chien sur la plage Dog on the beach of le of chien of sur of la of plage Encoder-Decoder Architecture [Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP 2014] RNN-based Sentence Generation (Decoder) Conte P(ne word is dog ) P(ne word is on ) P(ne word is the ) P(ne word is beach ) Conte of le of chien of sur of la of plage of START of dog of on Model long-term information of the

12 Image Captioning Recurrent Neural Network Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep isual-semantic Alignments for Generating Image Descriptions, Karpathy Fei-Fei Show Tell: A Neural Image Caption Generator, inyals et al. Long-term Recurrent Convolutional Networks for isual Recognition Description, Donahue et al. Learning a Recurrent isual Representation for Image Caption Generation, Chen Zitnick Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture test image 8 Feb 2016 Convolutional Neural Network Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture test image 8 Feb 2016

13 test image test image x0 <STA RT> X <START> test image y0 test image y0 before: h = tanh(xh * x + hh * h) h0 ih x0 <STA RT> v sample! h0 now: h = tanh(xh * x + hh * h + ih * v) <START> x0 <STA RT> <START> straw

14 test image test image y0 y1 y0 y1 h0 h1 h0 h1 x0 <STA RT> straw x0 <STA RT> straw <START> sample! hat <START> test image y0 y1 h0 h1 x0 <STA RT> straw <START> y2 h2 hat test image y0 y1 h0 h1 x0 <STA RT> straw <START> y2 h2 hat sample <END> token => finish.

Microsoft COCO - Can t use human evaluation for validating models - too slow expensive.

15 Evaluation of te translation results Tricky to do automatically! Ideally want humans to evaluate Image Sentence Datasets - hat do you ask? Microsoft COCO - Can t use human evaluation for validating models - too slow expensive. [Tsung-Yi Lin et al. 2014] mscoco.org se stard machine translation metrics instead - BLE - ROGE CIDER - Meteor currently: ~120K images ~5 sentences each

16 Focus on gradient of loss w.r.t. Problem of exploding vanishing gradients in an RNN Take a closer look at = gt T h T t 1 where gt = = depends on for t = 1,..., τ. Let s take a closer look at... Denote got = D(at) = diag(1 tanh 2 (at)) Remember = goτ = = goτ D(aτ ) hτ aτ for t = τ 1, τ 2,..., 1: = got + ht +1 = D(at) ht Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 ( j t ) = go j D(at+k) j t j=t k=0 Focus on Denote got = D(at) = diag(1 tanh 2 (at)) Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 = go j j=t ( j t ) D(at+k) k=0 }{{} likely has small values on diagonal j t Focus on hy? Each matrix D(at+k) has tanh (at+k) on its diagonal 0 tanh (a) 1. Thus (tanh (a)) j t+1 is highly likely to have a small value even for not too large j t + 1. = only depends on first few entries in the sum.

17 Focus on Focus on n Denote got = D(at) = diag(1 tanh 2 (at)) Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 ( j t ) = go j D(at+k) }{{ j t } j=t k=0 potentially has very large or small values hy?... Remember has size m m. Assume is diagonalizable. Let its eigen-decomposition be = QΛQ T where Q is orthogonal Λ is a diagonal matrix containing the eigenvalues of. Then n = QΛ n Q T Let λ1,..., λm be the e-values of. Thus - If λi > 1 = λ n i will explode as n increases. - If λi < 1 = λ n i 0 as n increases. Focus on Summary on Denote got = D(at) = diag(1 tanh 2 (at)) Denote got = D(at) = diag(1 tanh 2 (at)) Then you can show by recursive substitution that ( j t ) = go j D(at+k) j t ht j=t k=1 ( j t ) = go j D(at+k) }{{ j t } j=t k=0 potentially has very large or small values Thus for sufficiently large j t either entries in j t can explode or vanish. Then you can show by recursive substitution that ht = = j=t j=t go j go j j t k=1 j t If j t explodes for j t > N = k=0 D(a t+k) j t D(a t+k) j t explodes = explodes. If j t vanishes for j t > N = only has contributions from nearby go where t t t t + N = is based on aggregation of gradients from subsets of temporally nearby states.

alue grows eremely is easily detectable at ion Problem. hen the Denote ndetected while drasel for far-away wordsgot = D(at) = diag(1 tanh 2 (at)) radient Problem.

18 alue grows eremely is easily detectable at ion Problem. hen the Denote ndetected while drasel for far-away wordsgot = D(at) = diag(1 tanh 2 (at)) radient Problem. ing gradient Then problem, you can show by recursive substitution that j t = go j D(a t+k) j t ht j=t k=1 Summary on adients j t = go j D(a t+k) f the vanishing gradip neural networks, let j t j=t k=0 If j t explodes for j t > N = olve these problems. explodes = explodes. Thomas Mikolov first s gradients If to a j t small vanishes for j t > N = only has contributions from nearby go where t t t t + N ever they reach a = cerber as shown nearby in Algo- states. is based on aggregation of gradients from subsets of temporally = Cannot learn Algorithm long-range 1: dependencies Psudo-code for between norm clipping in the gradients whenever they ex- states. plode Gradient clipping Easy solution to exploding gradients Let G = then { θ ping. It shows the deork with respect to its G otherwise G G = G if G θ nsists of a where singleθ unit is some sensible threshold. small number of timegress on each A simple gradient heuristic first introduced by Thomas Mikolov. hits the high error wall off to a far-away local produces the dashed nt to somewhere close Solution to Exploding & anishing Gradients Easy partial solutions to vanishing gradients Solution 1: Initialize as the identity matrix as opposed a rom initialization. Solution 2: se ReL instead of tanh as the non-linear activation function. e introduce two techinitializing (hh) ranation. Dashed arrow shows what happens when the gradient is rescaled to a Figure 5: Gradient explosion clipping fixed size when its visualization norm above a threshold.

19 Easy partial solutions to vanishing gradients Solution 1: Initialize as the identity matrix as opposed a rom initialization. Solution 2: se ReL instead of tanh as the non-linear activation function. Long-Short-Term-Memories (LSTMs) - capturing long-range dependencies Still hard for an RNN to capture long-term dependencies. LSTMs Core Idea: Introduce a memory cell LSTMs Core Idea: Introduce a memory cell ot 1 ot ot+1 ot 1 ot ot+1 h.. ht ht+1 h.. h.. ht ht+1 h High-level graphic of an RNN LSTMs similar to RNN but they introduce a memory cell state ct. c.. ct 1 ct ct+1 c.. High-level graphic of a LSTM LSTMs similar to RNN but they introduce a memory cell state ct. LSTMs have the ability to remove or add information to ct regulated by structures called gates based on conte. pdate of ct designed so gradients flows these nodes backward in time easily. ct then controls what information from ct 1 should be used to generate ht.

20 ct 1 it ft ct 1 + it ct ft ot ct ot tanh(ct) ht ct 1 it ft ct 1 + it ct ft ot ct ct 1 ot tanh(ct) it ft ct 1 + it ct ht ft ot ct ot tanh(ct) ht LSTMs formal details LSTMs basic unit LSTMs (Hochreiter & Schmidhuer, 1997) better at capturing long term dependencies. Input gate σ( i x t + i h t 1 ) Introduces gates to calculate ht, ct from ct 1,. New temporary memory Formal description of a LSTM unit: tanh(cx t + ch t 1 ) it = σ(i + i) Input gate ft = σ(f + f) ot = σ(o + o) Forget gate Output/Exposure gate Forget gate ct = tanh(c + c) New memory cell σ( f x t + f h t 1 ) ct = ft ct 1 + it ct ht = ot tanh(ct) Final memory cell Output gate where - σ( ) is the sigmoid function - denotes element by element multiplication. σ(ox t + oh t 1 ) ct LSTMs basic unit LSTMs basic unit Input gate Input gate σ( i x t + i h t 1 ) New temporary memory: se to generate new memory that includes aspects of. σ( i x t + i h t 1 ) New temporary memory: se to generate new memory that includes aspects of. New temporary memory tanh(cx t + ch t 1 ) ct Input gate: se to determine whether the temporary memory ct is worth preserving. New temporary memory tanh(cx t + ch t 1 ) ct Input gate: se to determine whether the temporary memory ct is worth preserving. Forget gate Forget gate: Assess whether the past memory cell ct 1 should be included in ct. Forget gate Forget gate: Assess whether the past memory cell ct 1 should be included in ct. Output gate σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine the new temporary memory the current memory cell state to get ct. Output gate σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine the new temporary memory the current memory cell state to get ct. σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht. σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht.

21 ct 1 ct 1 it ft ct 1 + it ct ft ot it ft ct 1 + it ct ft ot ct ct ot tanh(ct) ot tanh(ct) ht ht ct 1 it ft ct 1 + it ct ft ot ct ot tanh(ct) ht LSTMs basic unit LSTMs basic unit Input gate Input gate σ( i x t + i h t 1 ) New temporary memory: se σ( i x t + i h t 1 ) New temporary memory: se to generate new memory to generate new memory that includes aspects of. that includes aspects of. New temporary memory Input gate: se to New temporary memory Input gate: se to tanh(cx t + ch t 1 ) determine whether the temporary tanh(cx t + ch t 1 ) determine whether the temporary ct memory ct is worth preserving. ct memory ct is worth preserving. Forget gate: Assess whether the Forget gate: Assess whether the Forget gate past memory cell ct 1 should be included in ct. Forget gate past memory cell ct 1 should be included in ct. σ( f x t + f h t 1 ) pdated memory state: se the σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine forget input gates to combine the new temporary memory the the new temporary memory the Output gate current memory cell state to get ct. Output gate current memory cell state to get ct. σ(ox t + oh t 1 ) Output gate: Decides which part σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht. of ct should be exposed to ht. LSTMs basic unit Input gate σ( i x t + i h t 1 ) New temporary memory: se to generate new memory that includes aspects of. New temporary memory tanh(cx t + ch t 1 ) ct Input gate: se to determine whether the temporary memory ct is worth preserving. Can go deep with LSTMs Forget gate: Assess whether the Forget gate past memory cell ct 1 should be included in ct. Output gate σ( f x t + f h t 1 ) pdated memory state: se the forget input gates to combine the new temporary memory the current memory cell state to get ct. σ(ox t + oh t 1 ) Output gate: Decides which part of ct should be exposed to ht.

22 35 Deep LSTM Network Bi-directional LSTM Network L (1) L (2) L (3) L (τ) L (1) L (2) L (3) L (τ) y (1) z (1) y (2) z (2) y (3) z (3) y (τ) z (τ) y (1) z (1) y (2) z (2) y (3) z (3) y (τ) z (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) LSTM (1) LSTM (2) LSTM (3) LSTM (τ) x (1) x (2) x (3) x (τ) 1 x (1) x (2) x (3) x (τ) 34 Summary RNNs allow a lot of flexibility in architecture design Backward flow of gradients in RNN can explode or vanish. anilla RNNs are simple but find it hard to learn long-term dependencies. Exploding is controlled with gradient clipping. anishing is controlled with additive interactions (LSTM) Common to use LSTMs: their additive interactions improve gradient flow

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second