Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves

Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though

Sequential Data: Text What

Sequential Data: Text What about

Sequential Data: Text What about inputs that appear in sequences, such as text? Could a neural network handle such modalities?

Memory needed Pr x = Pr x i x 1,, x i 1 ) i What about inputs that appear in sequences, such as text? Could a neural network handle such modalities?

Sequential data: Video

Quiz: Other sequential data?

Quiz: Other sequential data? Time series data Stock exchange Biological measurements Climate measurements Market analysis Speech/Music User behavior in websites...

Applications Click to go to the video in Youtube Click to go to the website

Machine Translation The phrase in the source language is one sequence Today the weather is good The phrase in the target language is also a sequence Погода сегодня хорошая Погода сегодня хорошая <EOS> LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good <EOS> Погода сегодня хорошая Encoder Decoder

Image captioning An image is a thousand words, literally! Pretty much the same as machine transation Replace the encoder part with the output of a Convnet E.g. use Alexnet or a VGG16 network Keep the decoder part to operate as a translator Today the weather is good <EOS> Convnet LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good

Demo Click to go to the video in Youtube

Question answering Bleeding-edge research, no real consensus Very interesting open, research problems Again, pretty much like machine translation Again, Encoder-Decoder paradigm Insert the question to the encoder part Model the answer at the decoder part Question answering with images also Again, bleeding-edge research How/where to add the image? What has been working so far is to add the image only in the beginning Q: John entered the living room, where he met Mary. She was drinking some wine and watching a movie. What room did John enter? A: John entered the living room. Q: what are the people playing? A: They play beach football

Demo Click to go to the website

Handwriting Click to go to the website

Recurrent Networks Output NN Cell State Recurrent connections Input

Sequences Next data depend on previous data Roughly equivalent to predicting what comes next Pr x = Pr x i x 1,, x i 1 ) i What about inputs that appear in sequences,suc as text?could a neural network h handle such modalities?

Why sequences? Parameters are reused Fewer parameters Easier modelling Generalizes well to arbitrary lengths Not constrained to specific length RecurrentModel( RecurrentModel( I think, therefore, I am! ) Everything is repeated, in a circle. History is a master because it teaches us that it doesn't exist. It's the permutations that matter. ) However, often we pick a frame T instead of an arbitrary length Pr x = Pr x i x i T,, x i 1 ) i

Quiz: What is really a sequence? Data inside a sequence are? McGuire Bond I am Bond, James tired am!

Quiz: What is really a sequence? Data inside a sequence are non i.i.d. Identically, independently distributed The next word depends on the previous words Ideally on all of them We need context, and we need memory! How to model context and memory? McGuire Bond I am Bond, James Bond tired am!

x i One-hot vectors A vector with all zeros except for the active dimension 12 words in a sequence 12 One-hot vectors After the one-hot vectors apply an embedding Word2Vec, GloVE Vocabulary I am Bond James tired, McGuire! I am Bond James tired, McGuire! 1 I am Bond James tired, McGuire! One-hot vectors I 1 am Bond James tired, McGuire! I am Bond James tired, McGuire! 1 1

Quiz: Why not just indices? One-hot representation OR? Index representation x t=1,2,3,4 = I am James McGuire 1 1 1 1 I am James McGuire x "I" = 1 x "am" = 2 x "James" = 4 x "McGuire" = 7

Quiz: Why not just indices? One-hot representation OR? Index representation I am James McGuire 1 x "I" 1 1 1 x "am" x "James" x "McGuire" distance(x "am", x "McGuire" ) = 1 distance(x "I", x "am" ) = 1 I am James McGuire q "I" = 1 q "am" = 2 q "James" = 4 q "McGuire" = 7 distance(q "am", q "McGuire" ) = 5 distance(q "I", q "am" ) = 1 No, because then some words get closer together for no good reason (artificial bias between words)

Memory A representation of the past A memory must project information at timestep t on a latent space c t using parameters θ Then, re-use the projected information from t at t + 1 c t+1 = h(x t+1, c t ; θ) Memory parameters θ are shared for all timesteps t =, c t+1 = h(x t+1, h(x t, h(x t 1, h x 1, c ; θ ; θ); θ); θ)

Memory as a Graph Simplest model Input with parameters U Memory embedding with parameters W Output with parameters V Output parameters Memory parameters Output W Input parameters V U y t c t Input x t Memory

Memory as a Graph Simplest model Input with parameters U Memory embedding with parameters W Output with parameters V Output y t y t+1 Output parameters Memory parameters W V c t W c t+1 Memory Input parameters U Input x t x t+1

Memory as a Graph Simplest model Input with parameters U Memory embedding with parameters W Output with parameters V Output y t y t+1 y t+2 y t+3 Output parameters V Memory parameters W W V W c t c t+1 c t+2 c t+3 Memory V W V W Input parameters U Input x t U U x t+1 x t+2 x t+3 U

Folding the memory Unrolled/Unfolded Network Folded Network y t y t+1 y t+2 y t V V V W W W W c t c t+1 c t+2 c t U U U (ct 1) x t x t+1 x t+2 x t

Recurrent Neural Network (RNN) Only two equations y t c t = tanh(u x t + Wc t 1 ) y t = softmax(v c t ) V U c t W (c t 1 ) x t

RNN Example Vocabulary of 5 words A memory of 3 units Hyperparameter that we choose like layer size c t : 3 1, W: [3 3] An input projection of 3 dimensions U: [3 5] An output projections of 1 dimensions V: [1 3] U x t=4 = c t = tanh(u x t + Wc t 1 ) y t = softmax(v c t )

Rolled Network vs. MLP? What is really different? Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different y 1 y 2 y 3 Final output Layer/Step 1 V Layer/Step 2 V V Layer/Step 3 W x W U 3-gram Unrolled Recurrent Network U W U x Layer 1 Layer 2 Layer 3 W 1 W 2 W 3 y 3-layer Neural Network

Rolled Network vs. MLP? What is really different? Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different y 1 y 2 y 3 Final output Layer/Step 1 V Layer/Step 2 V V Layer/Step 3 x W W W W 1 W 2 W 3 x Layer 1 Layer 2 Layer 3 y U 3-gram Unrolled Recurrent Network U U 3-layer Neural Network

Rolled Network vs. MLP? What is really different? Steps instead of layers Sometimes intermediate outputs are not even needed Removing them, we almost end up to a standard Neural Network Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different y 1 y 2 y 3 Final output Layer/Step 1 V Layer/Step 2 V V Layer/Step 3 x W W W W 1 W 2 W 3 x Layer 1 Layer 2 Layer 3 y U 3-gram Unrolled Recurrent Network U U 3-layer Neural Network

Rolled Network vs. MLP? What is really different? Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different V Layer/Step 1 Layer/Step 2 Layer/Step 3 Final output y W W W W W 1 2 W 3 x x Layer 1 Layer 2 Layer 3 y U 3-gram Unrolled Recurrent Network U U 3-layer Neural Network

Training Recurrent Networks Cross-entropy loss l P = y tk tk L = log P = L t = 1 T t,k t t Backpropagation Through Time (BPTT) Again, chain rule Only difference: Gradients survive over time steps l t log y t

Training RNNs is hard Vanishing gradients After a few time steps the gradients become almost Exploding gradients After a few time steps the gradients become huge Can t capture long-term dependencies

Alternative formulation for RNNs An alternative formulation to derive conclusions and intuitions c t = W tanh(c t 1 ) + U x t + b L = L t (c t ) t

Another look at the gradients L = L(c T (c T 1 c 1 x 1, c ; W ; W ; W ; W) L c t c t c τ = L c t L t t L t c t c t c τ c τ W W = τ=1 c t c t 1 c τ+1 c t 1 c t 2 c τ η t τ L t c t Rest short-term factors t τ long-term factors The RNN gradient is a recursive product of c t c t 1

Exploding/Vanishing gradients L = L c t c T L = L c t c T c T c T 1 c T 1 c T 2 c t+1 c ct < 1 < 1 < 1 c T c T 1 c T 1 c T 2 c 1 c ct > 1 > 1 > 1 L W 1 Vanishing gradient L W 1 Exploding gradient

Vanishing gradients The gradient of the error w.r.t. to intermediate cell L t t W = τ=1 L r y t y t c t c t c τ c τ W c t c τ = t k τ c k = W tanh(c c k 1 ) k 1 t k τ

Vanishing gradients The gradient of the error w.r.t. to intermediate cell L t t W = τ=1 L r y t y t c t c t c τ c τ W c t c τ = t k τ c k = W tanh(c c k 1 ) k 1 t k τ Long-term dependencies get exponentially smaller weights

Gradient clipping for exploding gradients Scale the gradients to a threshold Pseudocode 1. g L W 2. if g > θ : g θ g g g θ g else: g print( Do nothing ) θ Simple, but works!

Rescaling vanishing gradients? Not good solution Weights are shared between timesteps Loss summed over timesteps L = L t L W = L t W t t t t L t W = L t c τ c τ W = L t c t c τ c t c τ W τ=1 τ=1 Rescaling for one timestep ( L t W ) affects all timesteps The rescaling factor for one timestep does not work for another

More intuitively L 1 W L 2 W L = L 1 + L 2 W W W + L 3 + L 4 + L 5 W W W L 3 W L 4 W L 5 W

More intuitively Let s say L 1 1, L 2 1/1, L 3 1/1, L 4 1/1, L 5 1/1 W W W W W L W = L r W = 1.1111 r If L W rescaled to 1 L 5 W 1 5 Longer-term dependencies negligible Weak recurrent modelling Learning focuses on the short-term only L 1 W L = L 1 + L 2 W W W + L 3 + L 4 + L 5 W W W L 2 W L 3 W L 4 W L 5 W

Recurrent networks Chaotic systems

Fixing vanishing gradients Regularization on the recurrent weights Force the error signal not to vanish Ω = Ω t = t t L c t+1 c t+1 c t L c t+1 1 2 Advanced recurrent modules Long-Short Term Memory module Gated Recurrent Unit module Pascanu et al., On the diffculty of training Recurrent Neural Networks, 213

Advanced Recurrent Nets c t 1 + c t tanh f t i t o t c t tanh m t 1 m t Output Input x t

How to fix the vanishing gradients? Error signal over time must have not too large, not too small norm Let s have a look at the loss function L t t W = τ=1 c t c τ = t k τ L r y t y t c t c t c τ c τ W c k c k 1

How to fix the vanishing gradients? Error signal over time must have not too large, not too small norm Solution: have an activation function with gradient equal to 1 L t t W = τ=1 c t c τ = t k τ L r y t y t c t c t c τ c τ W c k c k 1 Identify function has a gradient equal to 1 By doing so, gradients do not become too small not too large

Long Short-Term Memory LSTM: Beefed up RNN Simple RNN c t = W tanh(c t 1 ) + U x t + b LSTM i = x t U (i) + m t 1 W (i) f = x t U (f) + m t 1 W (f) o = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t 1 f t i t + o t tanh c t c t = c t 1 f + c t i m t = tanh c t o c t tanh - The previous state c t 1 is connected to new c t with no nonlinearity (identity function). - The only other factor is the forget gate f which rescales the previous LSTM state. m t 1 x t Input m t Output More info at: http://colah.github.io/posts/215-8-understanding-lstms/

Cell state The cell state carries the essential information over time Cell state line i = x t U (i) + m t 1 W (i) f = x t U (f) + m t 1 W (f) o = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t 1 f t i t + o t tanh c t c t = c t 1 f + c t i m t = tanh c t o c t tanh m t 1 m t xt

LSTM nonlinearities (, 1): control gate something like a switch tanh 1, 1 : recurrent nonlinearity i = x t U (i) + m t 1 W (i) f = x t U (f) + m t 1 W (f) o = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t 1 f t i t + o t tanh c t c t = c t 1 f + c t i m t = tanh c t o c t tanh m t 1 m t Output Input x t

LSTM Step-by-Step: Step (1) E.g. LSTM on Yesterday she slapped me. Today she loves me. Decide what to forget and what to remember for the new memory Sigmoid 1 Remember everything Sigmoid Forget everything i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) c t 1 f t i t + o t tanh c t o t = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t tanh c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Step-by-Step: Step (2) Decide what new information is relevant from the new input and should be add to the new memory Modulate the input i t Generate candidate memories c t c t 1 + c t i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) f t i t o t tanh o t = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t tanh c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Step-by-Step: Step (3) Compute and update the current cell state c t Depends on the previous cell state What we decide to forget What inputs we allow The candidate memories i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) o t = x t U (o) + m t 1 W (o) c t 1 c t = tanh(x t U g + m t 1 W (g) ) f t i t + c t tanh o t tanh c t c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Step-by-Step: Step (4) Modulate the output Does the new cell state relevant? Sigmoid 1 If not Sigmoid Generate the new memory c t 1 + c t i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) f t i t o t tanh o t = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t tanh c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Unrolled Network Macroscopically very similar to standard RNNs The engine is a bit different (more complicated) Because of their gates LSTMs capture long and short term dependencies + + + tanh tanh tanh tanh tanh tanh

Beyond RNN & LSTM LSTM with peephole connections Gates have access also to the previous cell states c t 1 (not only memories) Coupled forget and input gates, c t = f t c t 1 + 1 f t c t Bi-directional recurrent networks Gated Recurrent Units (GRU) Deep recurrent architectures Recursive neural networks Tree structured Multiplicative interactions Generative recurrent architectures LSTM (2) LSTM (1) LSTM (2) LSTM (1) LSTM (2) LSTM (1)

Take-away message Recurrent Neural Networks (RNN) for sequences Backpropagation Through Time Vanishing and Exploding Gradients and Remedies RNNs using Long Short-Term Memory (LSTM) Applications of Recurrent Neural Networks

Thank you!