EE-559 Deep learning LSTM and GRU

Similar documents
EE-559 Deep learning Recurrent Neural Networks

Introduction to RNNs!

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

CSC321 Lecture 10 Training RNNs

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks. Jian Tang

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Faster Training of Very Deep Networks Via p-norm Gates

Slide credit from Hung-Yi Lee & Richard Socher

CSC321 Lecture 15: Exploding and Vanishing Gradients

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Long-Short Term Memory and Other Gated RNNs

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Stephen Scott.

Recurrent neural networks

Deep learning for Natural Language Processing and Machine Translation

Learning Long-Term Dependencies with Gradient Descent is Difficult

Lecture 11 Recurrent Neural Networks I

Gated Recurrent Neural Tensor Network

Neural Networks Language Models

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

arxiv: v1 [cs.ne] 19 Dec 2016

Lecture 15: Exploding and Vanishing Gradients

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Deep Gate Recurrent Neural Network

Short-term water demand forecast based on deep neural network ABSTRACT

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Natural Language Processing and Recurrent Neural Networks

Gated Feedback Recurrent Neural Networks

arxiv: v2 [cs.ne] 7 Apr 2015

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Combining Static and Dynamic Information for Clinical Event Prediction

Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Improved Learning through Augmenting the Loss

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

arxiv: v3 [cs.lg] 14 Jan 2018

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

RECURRENT NETWORKS I. Philipp Krähenbühl

Based on the original slides of Hung-yi Lee

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Today s Lecture. Dropout

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks

arxiv: v3 [cs.lg] 25 Oct 2017

Recurrent Residual Learning for Sequence Classification

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

arxiv: v1 [stat.ml] 18 Nov 2017

arxiv: v3 [cs.cl] 24 Feb 2018

arxiv: v1 [cs.cl] 31 May 2015

Minimal Gated Unit for Recurrent Neural Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Neural Architectures for Image, Language, and Speech Processing

arxiv: v1 [cs.cl] 21 May 2017

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

EE-559 Deep learning 3.5. Gradient descent. François Fleuret Mon Feb 18 13:34:11 UTC 2019

Character-level Language Modeling with Gated Hierarchical Recurrent Neural Networks

EE-559 Deep learning 1.6. Tensor internals

Sequence Modeling with Neural Networks

Deep Learning Recurrent Networks 2/28/2018

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

CSC321 Lecture 16: ResNets and Attention

Lecture 17: Neural Networks and Deep Learning

Tensor Decomposition for Compressing Recurrent Neural Network

Structured Neural Networks (I)

KNOWLEDGE-GUIDED RECURRENT NEURAL NETWORK LEARNING FOR TASK-ORIENTED ACTION PREDICTION

Deep Learning for Automatic Speech Recognition Part II

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

CSCI 315: Artificial Intelligence through Deep Learning

Deep Learning for Computer Vision

CAN RECURRENT NEURAL NETWORKS WARP TIME?

Recurrent and Recursive Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Deep Residual. Variations

Generating Sequences with Recurrent Neural Networks

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

A Machine Learning Approach for Virtual Flow Metering and Forecasting

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Long-Short Term Memory

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Seq2Tree: A Tree-Structured Extension of LSTM Network

Word Attention for Sequence to Sequence Text Understanding

Lecture 8: Recurrent Neural Networks

Compressing Recurrent Neural Network with Tensor Train

Lecture 5: Recurrent Neural Networks

Chinese Character Handwriting Generation in TensorFlow

CKY-based Convolutional Attention for Neural Machine Translation

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Transcription:

EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE

The Long-Short Term Memory unit (LSTM) by Hochreiter and Schmidhuber (1997), is a recurrent network with a gating of the form c t = c t 1 + i t g t where c t is a recurrent state, i t is a gating function and g t is a full update. This assures that the derivatives of the loss wrt c t does not vanish. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 1 / 17

It is noteworthy that this model implemented 20 years before the resnets of He et al. (2015) uses the exact same strategy to deal with depth. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 2 / 17

It is noteworthy that this model implemented 20 years before the resnets of He et al. (2015) uses the exact same strategy to deal with depth. This original architecture was improved with a forget gate (Gers et al., 2000), resulting in the standard LSTM in use. In what follows we consider notation and variant from Jozefowicz et al. (2015). François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 2 / 17

The recurrent state is composed of a cell state c t and an output state h t. Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 3 / 17

The recurrent state is composed of a cell state c t and an output state h t. Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. f t = sigm ( ) W (x f) x t + W (h f) h t 1 + b (f) i t = sigm ( ) W (x i) x t + W (h i) h t 1 + b (i) g t = tanh ( ) W (x c) x t + W (h c) h t 1 + b (c) c t = f t c t 1 + i t g t o t = sigm ( ) W (x o) x t + W (h o) h t 1 + b (o) h t = o t tanh(c t) (forget gate) (input gate) (full cell state update) (cell state) (output gate) (output state) François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 3 / 17

The recurrent state is composed of a cell state c t and an output state h t. Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. f t = sigm ( ) W (x f) x t + W (h f) h t 1 + b (f) i t = sigm ( ) W (x i) x t + W (h i) h t 1 + b (i) g t = tanh ( ) W (x c) x t + W (h c) h t 1 + b (c) c t = f t c t 1 + i t g t o t = sigm ( ) W (x o) x t + W (h o) h t 1 + b (o) h t = o t tanh(c t) (forget gate) (input gate) (full cell state update) (cell state) (output gate) (output state) As pointed out by Gers et al. (2000), the forget bias b (f) should be initialized with large values so that initially f t 1 and the gating has no effect. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 3 / 17

The recurrent state is composed of a cell state c t and an output state h t. Gate f t modulates if the cell state should be forgotten, i t if the new update should be taken into account, and o t if the output state should be reset. f t = sigm ( ) W (x f) x t + W (h f) h t 1 + b (f) i t = sigm ( ) W (x i) x t + W (h i) h t 1 + b (i) g t = tanh ( ) W (x c) x t + W (h c) h t 1 + b (c) c t = f t c t 1 + i t g t o t = sigm ( ) W (x o) x t + W (h o) h t 1 + b (o) h t = o t tanh(c t) (forget gate) (input gate) (full cell state update) (cell state) (output gate) (output state) As pointed out by Gers et al. (2000), the forget bias b (f) should be initialized with large values so that initially f t 1 and the gating has no effect. This model was extended by Gers et al. (2003) with peephole connections that allow gates to depend on c t 1. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 3 / 17

y t 1 yt Ψ Ψ h 1 t 1 h 1 t c 1 t 1 LSTM cell c 1 t xt François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 4 / 17

y t 1 yt Ψ Ψ h 1 t 1 h 1 t c 1 t 1 LSTM cell c 1 t xt Prediction is done from the h t state, hence called the output state. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 4 / 17

Several such cells can be combined to create a multi-layer LSTM. y t 1 yt Ψ Ψ h 1 t 1 h 1 t c 1 t 1 LSTM cell c 1 t xt François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 5 / 17

Several such cells can be combined to create a multi-layer LSTM. y t 1 yt Ψ Ψ h 2 t 1 h 2 t c 2 t 1 LSTM cell c 2 t h 1 t 1 h 1 t c 1 t 1 LSTM cell c 1 t xt François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 5 / 17

Several such cells can be combined to create a multi-layer LSTM. y t 1 yt Ψ Two layer LSTM Ψ h 2 t 1 h 2 t c 2 t 1 LSTM cell c 2 t h 1 t 1 h 1 t c 1 t 1 LSTM cell c 1 t xt François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 5 / 17

PyTorch s torch.nn.lstm implements this model. Its processes several sequences, and returns two tensors, with D the number of layers and T the sequence length: the outputs for all the layers at the last time step: h 1 T and hd T, and the outputs of the last layer at each time step: h D 1,, hd T. The initial recurrent states h 1 0,, hd 0 and c1 0,, cd 0 can also be specified. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 6 / 17

PyTorch s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.packedsequence. Such an object can be created with nn.utils.rnn.pack_padded_sequence: François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 7 / 17

PyTorch s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.packedsequence. Such an object can be created with nn.utils.rnn.pack_padded_sequence: >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]],... [[ 3. ], [ 4. ]],... [[ 5. ], [ 0. ]]]),... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 7 / 17

PyTorch s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.packedsequence. Such an object can be created with nn.utils.rnn.pack_padded_sequence: >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]],... [[ 3. ], [ 4. ]],... [[ 5. ], [ 0. ]]]),... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) The sequences must be sorted by decreasing lengths. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 7 / 17

PyTorch s RNNs can process batches of sequences of same length, that can be encoded in a regular tensor, or batches of sequences of various lengths using the type nn.utils.rnn.packedsequence. Such an object can be created with nn.utils.rnn.pack_padded_sequence: >>> from torch.nn.utils.rnn import pack_padded_sequence >>> pack_padded_sequence(torch.tensor([[[ 1. ], [ 2. ]],... [[ 3. ], [ 4. ]],... [[ 5. ], [ 0. ]]]),... [3, 2]) PackedSequence(data=tensor([[ 1.], [ 2.], [ 3.], [ 4.], [ 5.]]), batch_sizes=tensor([ 2, 2, 1])) The sequences must be sorted by decreasing lengths. nn.utils.rnn.pad_packed_sequence converts back to a padded tensor. François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 7 / 17

class LSTMNet(nn.Module): def init (self, dim_input, dim_recurrent, num_layers, dim_output): super(lstmnet, self). init () self.lstm = nn.lstm(input_size = dim_input, hidden_size = dim_recurrent, num_layers = num_layers) self.fc_o2y = nn.linear(dim_recurrent, dim_output) def forward(self, input): # Makes this a batch of size 1 input = input.unsqueeze(1) # Get the activations of all layers at the last time step output, _ = self.lstm(input) # Drop the batch index output = output.squeeze(1) output = output[output.size(0) - 1:output.size(0)] return self.fc_o2y(f.relu(output)) François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 8 / 17

0.5 Baseline w/ Gating LSTM 0.4 0.3 Error 0.2 0.1 0 0 50000 100000 150000 200000 250000 Nb. sequences seen François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 9 / 17

0.5 Baseline w/ Gating LSTM 0.4 0.3 Error 0.2 0.1 0 2 4 6 8 10 12 14 16 18 20 Sequence length François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 10 / 17

The LSTM were simplified into the Gated Recurrent Unit (GRU) by Cho et al. (2014), with a gating for the recurrent state, and a reset gate. r t = sigm ( ) W (x r) x t + W (h r) h t 1 + b (r) z t = sigm ( ) W (x z) x t + W (h z) h t 1 + b (z) h t = tanh ( ) W (x h) x t + W (h h) (r t h t 1 ) + b (h) h t = z t h t 1 + (1 z t) h t (reset gate) (forget gate) (full update) (hidden update) François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 11 / 17

class GRUNet(nn.Module): def init (self, dim_input, dim_recurrent, num_layers, dim_output): super(grunet, self). init () self.gru = nn.gru(input_size = dim_input, hidden_size = dim_recurrent, num_layers = num_layers) self.fc_y = nn.linear(dim_recurrent, dim_output) def forward(self, input): # Make this a batch of size 1 input = input.unsqueeze(1) # Get the activations of all layers at the last time step _, output = self.gru(input) # Drop the batch index output = output.squeeze(1) output = output.narrow[output.size(0) - 1:output.size(0)] return self.fc_y(f.relu(output)) François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 12 / 17

0.5 0.4 Baseline w/ Gating LSTM GRU 0.3 Error 0.2 0.1 0 0 50000 100000 150000 200000 250000 Nb. sequences seen François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 13 / 17

0.5 0.4 Baseline w/ Gating LSTM GRU 0.3 Error 0.2 0.1 0 2 4 6 8 10 12 14 16 18 20 Sequence length François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 14 / 17

The specific form of these units prevent the gradient from vanishing, but it may still be excessively large on certain mini-batch. The standard strategy to solve this issue is gradient norm clipping (Pascanu et al., 2013), which consists of re-scaling the [norm of the] gradient to a fixed threshold δ when if it was above: f = f min ( f, δ). f François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 15 / 17

The function torch.nn.utils.clip_grad_norm applies this operation to the gradient of a model, as defined by an iterator through its parameters: >>> x = torch.empty(10) >>> x.grad = x.new(x.size()).normal_() >>> y = torch.empty(5) >>> y.grad = y.new(y.size()).normal_() >>> torch.cat((x.grad, y.grad)).norm() tensor(4.0303) >>> torch.nn.utils.clip_grad_norm_((x, y), 5.0) tensor(4.0303) >>> torch.cat((x.grad, y.grad)).norm() tensor(4.0303) >>> torch.nn.utils.clip_grad_norm_((x, y), 1.25) tensor(4.0303) >>> torch.cat((x.grad, y.grad)).norm() tensor(1.2500) François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 16 / 17

Jozefowicz et al. (2015) conducted an extensive exploration of different recurrent architectures through meta-optimization, and even though some units simpler than LSTM or GRU perform well, they wrote: We have evaluated a variety of recurrent neural network architectures in order to find an architecture that reliably out-performs the LSTM. Though there were architectures that outperformed the LSTM on some problems, we were unable to find an architecture that consistently beat the LSTM and the GRU in all experimental conditions. (Jozefowicz et al., 2015) François Fleuret EE-559 Deep learning / 11.2. LSTM and GRU 17 / 17

The end

References K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014. F. A. Gers, J. A. Schmidhuber, and F. A. Cummins. Learning to forget: Continual prediction with lstm. Neural Computation, 12(10):2451 2471, 2000. F. A. Gers, N. N. Schraudolph, and J. Schmidhuber. Learning precise timing with lstm recurrent networks. Journal of Machine Learning Research (JMLR), 3:115 143, 2003. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8): 1735 1780, 1997. R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network architectures. In International Conference on Machine Learning (ICML), pages 2342 2350, 2015. R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning (ICML), 2013.