Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Similar documents
Introduction to RNNs!

Based on the original slides of Hung-yi Lee

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Slide credit from Hung-Yi Lee & Richard Socher

CSC321 Lecture 16: ResNets and Attention

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Long-Short Term Memory and Other Gated RNNs

EE-559 Deep learning LSTM and GRU

Recurrent Neural Network

arxiv: v3 [cs.lg] 14 Jan 2018

Lecture 17: Neural Networks and Deep Learning

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Introduction to Deep Neural Networks

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Recurrent Neural Networks

Sequence Modeling with Neural Networks

Recurrent Neural Networks. Jian Tang

CSC321 Lecture 10 Training RNNs

Deep Learning for Computer Vision

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Recurrent neural networks

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Deep Learning Recurrent Networks 2/28/2018

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

NEURAL LANGUAGE MODELS

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Conditional Language modeling with attention

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

arxiv: v1 [cs.lg] 28 Dec 2017

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Introduction to Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

arxiv: v1 [stat.ml] 18 Nov 2017

Gated RNN & Sequence Generation. Hung-yi Lee 李宏毅

Lecture 11 Recurrent Neural Networks I

arxiv: v1 [cs.cl] 21 May 2017

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Stephen Scott.

Learning Long-Term Dependencies with Gradient Descent is Difficult

Neural Networks Language Models

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Improved Learning through Augmenting the Loss

RECURRENT NETWORKS I. Philipp Krähenbühl

Deep Gate Recurrent Neural Network

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

IMPROVING STOCHASTIC GRADIENT DESCENT

Stochastic Dropout: Activation-level Dropout to Learn Better Neural Language Models

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

EE-559 Deep learning Recurrent Neural Networks

Minimal Gated Unit for Recurrent Neural Networks

Gated Recurrent Networks

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CSC321 Lecture 15: Exploding and Vanishing Gradients

Natural Language Processing and Recurrent Neural Networks

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Natural Language Processing

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Lecture 8: Recurrent Neural Networks

Deep Learning for NLP

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition

arxiv: v2 [cs.lg] 26 Jan 2016

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Faster Training of Very Deep Networks Via p-norm Gates

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Deep Feedforward Networks

Deep Learning Recurrent Networks 10/11/2017

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Lecture 9 Recurrent Neural Networks

Deep Residual. Variations

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Memory-Augmented Attention Model for Scene Text Recognition

arxiv: v3 [cs.lg] 4 Jun 2016 Abstract

Lecture 15: Exploding and Vanishing Gradients

Neural Architectures for Image, Language, and Speech Processing

Lecture 7 Convolutional Neural Networks

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Speech and Language Processing

Convolutional Neural Network Architecture

Structured Neural Networks (I)

Generating Sequences with Recurrent Neural Networks

Feature Design. Feature Design. Feature Design. & Deep Learning

Regularizing RNNs for Caption Generation by Reconstructing The Past with The Present

Transcription:

1 Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Summary We have shown: Now First order optimization methods: GD (BP), SGD, Nesterov, Adagrad, ADAM, RMSPROP, etc. Second order optimzation methods: L-BFGS Regularization methods: Penalty (L2/L1/Elastic), Dropout, Batch Normalization, Data Augmentation, etc. CNN Architectures: LeNet5, Alexnet, VGG, GoogleNet, Resnet Recurrent Neural Networks LSTM Reference: Feifei Li, Stanford cs231n

AlexNet and VGG have tons of parameters in the fully connected layers AlexNet: ~62M parameters FC6: 256x6x6 -> 4096: 38M params FC7: 4096 -> 4096: 17M params FC8: 4096 -> 1000: 4M params ~59M params in FC layers! ResNet allows deep networks with small number of params.

Recurrent Neural Networks

Vanilla Neural Network Vanilla Neural Networks

Recurrent Neural Networks: Process Sequences e.g. Image Captioning image -> sequence of words

Recurrent Neural Networks: Process Sequences e.g. Sentiment Classification sequence of words -> sentiment

Recurrent Neural Networks: Process Sequences e.g. Machine Translation seq of words -> seq of words

Recurrent Neural Networks: Process Sequences e.g. Video classification on frame level

Sequential Processing of Non-Sequence Data Classify images by taking a series of glimpses Ba, Mnih, and Kavukcuoglu, Multiple Object Recognition with Visual Attention, ICLR 2015. Gregor et al, DRAW: A Recurrent Neural Network For Image Generation, ICML 2015 Figure copyright Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, 2015. Reproduced with permission.

Recurrent Neural Network RNN x

Recurrent Neural Network y usually want to predict a vector at some time steps RNN x

We can process a sequence of vectors x by applying a recurrence formula at every time step: y new state old state input vector at some time step some function with parameters W RNN x

We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN Notice: the same function and the same set of parameters are used at every time step. x

Vanilla Recurrent Neural Networks State Space equations in feedback dynamical systems The state consists of a single hidden vector h: y RNN x Or, y t = softmax(w hy h t )

RNN: Computational Graph h 0 f W h 1 f W h 2 f W h 3 h T x 1 x 2 x 3

Time invariant systems RNN: Computational Graph Re-use the same weight matrix at every time-step h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3

Outputs added RNN: Computational Graph: Many to Many y 1 y 2 y 3 y T h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3

Loss modules RNN: Computational Graph: Many to Many L y 2 y 1 L 1 L 2 L 3 L T y 3 y T h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3

RNN: Computational Graph: Many to One y h 0 f W h 1 f W h 2 f W h 3 h T W x 1 x 2 x 3

RNN: Computational Graph: One to Many y 3 y 3 y 3 y T h 0 f W h 1 f W h 2 f W h 3 h T W x

Sequence to Sequence: Many-to-one + one-to-many Many to one: Encode input sequence in a single vector One to many: Produce output sequence from single input vector y y 1 2 h 0 f W h 1 f W h 2 f W h 3 h T f W h 1 f W h 2 f W W 1 x 1 x 2 x 3 W 2

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: hello

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: hello

Example: Character-level Language Model Vocabulary: [h,e,l,o] Example training sequence: hello

Example: Character-level Language Model Sampling Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model Sample Softmax e.03.13.00.84

Example: Character-level Language Model Sampling Sample Softmax e.03.13.00.84 l.25.20.05.50 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model

Example: Character-level Language Model Sampling Sample Softmax e.03.13.00.84 l.25.20.05.50 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model

Example: Character-level Language Model Sampling Sample Softmax e l l o.03.13.00.84.25.20.05.50.11.17.68.03.11.02.08.79 Vocabulary: [h,e,l,o] At test-time sample characters one at a time, feed back to model

Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient Loss

Truncated Backpropagation through time Loss Run forward and backward through chunks of the sequence instead of whole sequence

Truncated Backpropagation through time Loss Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

Truncated Backpropagation through time Loss

Example: Text->RNN y RNN x https://gist.github.com/karpathy/d4dee566867f8291f086

at first: train more train more train more

Image Captioning Figure from Karpathy et a, Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR 2015; figure copyright IEEE, 2015. Reproduced for educational purposes. Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick

Recurrent Neural Network Convolutional Neural Network

test image

X test image

test image Wih y0 h0 before: h = tanh(wxh * x + Whh * h) now: h = tanh(wxh * x + Whh * h + Wih * v) v x0 <STA RT> <START>

test image y0 h0 sample! x0 <STA RT> straw <START>

test image y0 y1 h0 h1 sample! x0 <STA RT> straw hat <START>

test image y0 h0 y1 h1 y2 h2 sample <END> token => finish. x0 <STA RT> straw hat <START>

Image Captioning: Example Results Captions generated using neuraltalk2 All images are CC0 Public domain: cat suitcase, cat tree, dog, bear, surfers, tennis, giraffe, motorcycle Popular Architectures A cat sitting on a suitcase on the floor Two people walking on the beach with surfboards A cat is sitting on a tree branch A tennis player in action on the court Fei-Fei Li & Justin Johnson & Serena Yeung A dog is running in the grass with a frisbee Two giraffes standing in a grassy field Lecture 10 - A white teddy bear sitting in the grass A man riding a dirt bike on a dirt track 75 May 4, 2017

Image Captioning: Failure Cases Captions generated using neuraltalk2 All images are CC0 Public domain: fur coat, handstand, spider web, baseball A bird is perched on a tree branch A woman is holding a cat in her hand A man in a baseball uniform throwing a ball A person holding a computer mouse on a desk A woman standing on a beach holding a surfboard Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 10-76 May 4, 2017

Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 W tanh h t-1 stack h t x t

Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 Backpropagation from h t to h t-1 multiplies by W (actually W hh T ) W tanh h t-1 stack h t x t

Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh)

Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients

Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Gradient clipping: Scale gradient if its norm is too big

Vanilla RNN Gradient Flow Bengio et al, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 1994 Pascanu et al, On the difficulty of training recurrent neural networks, ICML 2013 h 0 h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 Computing gradient of h 0 involves many factors of W (and repeated tanh) Largest singular value > 1: Exploding gradients Largest singular value < 1: Vanishing gradients Change RNN architecture

Long Short Term Memory (LSTM)

Long Short Term Memory (LSTM) Vanilla RNN LSTM Hochreiter and Schmidhuber, Long Short Term Memory, Neural Computation 1997

Long Short Term Memory (LSTM) [Hochreiter et al., 1997] vector from below (x) f: Forget gate, Whether to erase cell i: Input gate, whether to write to cell g: Gate gate (?), How much to write to cell o: Output gate, How much to reveal cell x sigmoid i W h vector from before (h) sigmoid sigmoid tanh 4h x 2h 4h 4*h f o g

Long Short Term Memory (LSTM) [Hochreiter et al., 1997] c t-1 + c t f W i g tanh h t-1 stack o h t x t

Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] c t-1 + c t Backpropagation from c t to c t-1 only elementwise multiplication by f, no matrix multiply by W f W i g tanh h t-1 stack o h t x t

Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c 0 c 1 c 2 c 3 Similar to ResNet! Pool 7x7 conv, 64 / 2 Input 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128... 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Softmax FC 1000 Pool

Long Short Term Memory (LSTM): Gradient Flow [Hochreiter et al., 1997] Uninterrupted gradient flow! c 0 c 1 c 2 c 3 In between: Highway Networks Similar to ResNet! Pool 7x7 conv, 64 / 2 Input 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128 3x3 conv, 128 / 2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128... 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 Softmax FC 1000 Pool Srivastava et al, Highway Networks, ICML DL Workshop 2015

Multilayer RNNs LSTM: depth time

Other RNN Variants GRU [Learning phrase representations using rnn encoder-decoder for statistical machine translation, Cho et al. 2014] [An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al., 2015] [LSTM: A Search Space Odyssey, Greff et al., 2015]

Image Captioning with Attention RNN focuses its attention at a different spatial location when generating each word Xu et al, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Image Captioning with Attention Distribution over L locations a1 CNN h0 Image: H x W x 3 Features: L x D Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

Image Captioning with Attention Distribution over L locations a1 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 CNN Features: L x D Weighted combination of features h0 Weighted features: D z1

Image Captioning with Attention Distribution over L locations a1 CNN h0 h1 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word

Image Captioning with Attention Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word

Image Captioning with Attention Distribution over L locations Distribution over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Image: H x W x 3 Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Features: L x D Weighted combination of features Weighted features: D z1 y1 First word z2 y2

Image Captioning with Attention Soft attention Hard attention Xu et al, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Image Captioning with Attention Xu et al, Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 Figure copyright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.

Summary RNN is flexible in architectures Vanilla RNNs are simple but don t work very well Common to use LSTM or GRU: their additive interactions improve gradient flow Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions Better/simpler architectures are a hot topic of current research Better understanding (both theoretical and empirical) is needed

Thank you!