1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST
Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second order optimzation methods: L-BFGS Regularization methods: Penalty (L2/L1/Elastic), Dropout, Batch Normalization, Data Augmentation, etc. CNN Architectures: LeNet5, Alexnet, VGG, GoogleNet, Resnet Recurrent Neural Networks LSTM Reference: Feifei Li, Stanford cs231n
AlexNet and VGG have tons of parameters in the fully connected layers AlexNet: ~62M parameters FC6: 256x6x6 -> 4096: 38M params FC7: 4096 -> 4096: 17M params FC8: 4096 -> 1000: 4M params ~59M params in FC layers! ResNet allows deep networks with small number of params.
Recurrent Neural Networks
Vanilla Neural Network Vanilla Neural Networks
Recurrent Neural Networks: Process Sequences e.g. Image Captioning image -> sequence of words
Recurrent Neural Networks: Process Sequences e.g. Sentiment Classification sequence of words -> sentiment
Recurrent Neural Networks: Process Sequences e.g. Machine Translation seq of words -> seq of words
Recurrent Neural Networks: Process Sequences e.g. Video classification on frame level
Sequential Processing of Non-Sequence Data Classify images by taking a series of glimpses Ba, Mnih, and Kavukcuoglu, Multiple Object Recognition with Visual Attention, ICLR 2015. Gregor et al, DRAW: A Recurrent Neural Network For Image Generation, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.
Recurrent Neural Network RNN x
Recurrent Neural Network y usually want to predict a vector at some time steps RNN x
We can process a sequence of vectors x by applying a recurrence formula at every time step: y new state old state input vector at some time step some function with parameters W RNN x
We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN Notice: the same function and the same set of parameters are used at every time step. x
Vanilla Recurrent Neural Networks State Space equations in feedback dynamical systems The state consists of a single hidden vector h: y RNN x Or, y t = softmax(w hy h t )
RNN: Computational Graph h 0 f W h 1 f W h 2 f W h 3 h T x 1 x 2 x 3
Time invariant systems RNN: Computational Graph Re-use the same weight matrix at every time-step h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3
Outputs added RNN: Computational Graph: Many to Many y 1 y 2 y 3 y T h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3
Loss modules RNN: Computational Graph: Many to Many L y 2 y 1 L 1 L 2 L 3 L T y 3 y T h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3
RNN: Computational Graph: Many to One y h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3
RNN: Computational Graph: One to Many y 3 y 3 y 3 y T h 0 f W h 1 f W h 2 f W h 3 h T W x
Sequence to Sequence: Many-to-one + one-to-many Many to one: Encode input sequence in a single vector One to many: Produce output sequence from single input vector y y 1 2 h 0 f W h 1 f W h 2 f W h 3 h T f W h 1 f W h 2 f W W 1 x 1 x 2 x 3 W 2
Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: hello
Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: hello
Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: hello
Example: Character-level Language Model Sampling Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Sample Softmax e.03.13.00.84
Example: Character-level Language Model Sampling Sample Softmax e.03.13.00.84 l.25.20.05.50 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model
Example: Character-level Language Model Sampling Sample Softmax e.03.13.00.84 l.25.20.05.50 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model
Example: Character-level Language Model Sampling Sample Softmax e l l o.03.13.00.84.25.20.05.50.11.17.68.03.11.02.08.79 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model
Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient Loss
Truncated Backpropagation through time Loss Run forward and backward through chunks of the sequence instead of whole sequence
Truncated Backpropagation through time Loss Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
Truncated Backpropagation through time Loss
Example: Text->RNN y RNN x https://gist.github.com/karpathy/d4dee566867f8291f086
at first: train more train more train more
Image Captioning Figure from Karpathy et a, Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR 2015; figure copyright IEEE, 2015. Reproduced for educational purposes. Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick
Recurrent Neural Network Convolutional Neural Network
test image
X test image
test image Wih y0 h0 before: h = tanh(wxh * x + Whh * h) now: h = tanh(wxh * x + Whh * h + Wih * v) v x0 <STA RT> <START>
test image y0 h0 sample! x0 <STA RT> straw <START>
test image y0 y1 h0 h1 sample! x0 <STA RT> straw hat <START>
test image y0 h0 y1 h1 y2 h2 sample <END> token => finish. x0 <STA RT> straw hat <START>
Image Captioning: Example Results Captions generated using neuraltalk2 All images are CC0 Public domain: cat suitcase, cat tree, dog, bear, surfers, tennis, giraffe, motorcycle Popular Architectures A cat sitting on a suitcase on the floor Two people walking on the beach with surfboards A cat is sitting on a tree branch A tennis player in action on the court Fei-Fei Li & Justin Johnson & Serena Yeung A dog is running in the grass with a frisbee Two giraffes standing in a grassy field Lecture 10 - A white teddy bear sitting in the grass A man riding a dirt bike on a dirt track 75 May 4, 2017
Image Captioning: Failure Cases Captions generated using neuraltalk2 All images are CC0 Public domain: fur coat, handstand, spider web, baseball A bird is perched on a tree branch A woman is holding a cat in her hand A man in a baseball uniform throwing a ball A person holding a computer mouse on a desk A woman standing on a beach holding a surfboard Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10-76 May 4, 2017
Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 W tanh h t-1 stack h t x t
Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 Backpropagation from h t to h t-1 multiplies by W (actually W hh T ) W tanh h t-1 stack h t x t
Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh)
Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients
Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Gradient clipping: Scale gradient if its norm is too big
Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Change RNN architecture
Long Short Term Memory (LSTM)
Long Short Term Memory (LSTM) Vanilla RNN LSTM Hochreiter and Schmidhuber, Long Short Term Memory, Neural Computation 1997
Long Short Term Memory (LSTM) [Hochreiter et al., 1997] vector from below (x) f: Forget gate, Whether to erase cell i: Input gate, whether to write to cell g: Gate gate (?), How much to write to cell o: Output gate, How much to reveal cell x sigmoid i W h vector from before (h) sigmoid sigmoid tanh 4h x 2h 4h 4*h f o g
Long Short Term Memory (LSTM) [Hochreiter et al., 1997] c t-1 + c t f W i g tanh h t-1 stack o h t x t
Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] c t-1 + c t Backpropagation from c t to c t-1 only elementwise multiplication by f, no matrix multiply by W f W i g tanh h t-1 stack o h t x t
Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c 0 c 1 c 2 c 3 Similar to ResNet! Pool 7x7 conv, 64 / 2 Input 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128... 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Softmax FC 1000 Pool
Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c 0 c 1 c 2 c 3 In between: Highway Networks Similar to ResNet! Pool 7x7 conv, 64 / 2 Input 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128... 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Softmax FC 1000 Pool Srivastava et al, Highway Networks, ICML DL Workshop 2015
Multilayer RNNs LSTM: depth time
Other RNN Variants GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014] [An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015] [LSTM: A Search Space Odyssey, Greff et al., 2015]
Image Captioning with Attention RNN focuses its attention at a different spatial location when generating each word Xu et al, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Image Captioning with Attention Distribution over L locations a1 CNN h0 Image: H x W x 3 Features: L x D Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015
Image Captioning with Attention Distribution over L locations a1 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 CNN Features: L x D Weighted combination of features h0 Weighted features: D z1
Image Captioning with Attention Distribution over L locations a1 CNN h0 h1 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word
Image Captioning with Attention Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word
Image Captioning with Attention Distribution over L locations Distribution over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word z2 y2
Image Captioning with Attention Soft attention Hard attention Xu et al, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Image Captioning with Attention Xu et al, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Summary RNN is flexible in architectures Vanilla RNNs are simple but don t work very well Common to use LSTM or GRU: their additive interactions improve gradient flow Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions Better/simpler architectures are a hot topic of current research Better understanding (both theoretical and empirical) is needed
Thank you!