Introduction to RNNs!

Similar documents
EE-559 Deep learning LSTM and GRU

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Lecture 11 Recurrent Neural Networks I

Slide credit from Hung-Yi Lee & Richard Socher

Sequence Modeling with Neural Networks

Long-Short Term Memory and Other Gated RNNs

arxiv: v3 [cs.lg] 14 Jan 2018

CSC321 Lecture 10 Training RNNs

Lecture 15: Exploding and Vanishing Gradients

Lecture 11 Recurrent Neural Networks I

Lecture 17: Neural Networks and Deep Learning

CSC321 Lecture 15: Exploding and Vanishing Gradients

Recurrent neural networks

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Structured Neural Networks (I)

CSC321 Lecture 16: ResNets and Attention

Stephen Scott.

Recurrent Neural Network

Long-Short Term Memory

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

CSC321 Lecture 15: Recurrent Neural Networks

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Faster Training of Very Deep Networks Via p-norm Gates

Recurrent Neural Networks

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Based on the original slides of Hung-yi Lee

Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

NEURAL LANGUAGE MODELS

Natural Language Processing and Recurrent Neural Networks

Recurrent and Recursive Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Neural Networks Language Models

Neural Architectures for Image, Language, and Speech Processing

Learning Long-Term Dependencies with Gradient Descent is Difficult

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

RECURRENT NETWORKS I. Philipp Krähenbühl

Neural Networks 2. 2 Receptive fields and dealing with image inputs

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

Deep learning for Natural Language Processing and Machine Translation

Natural Language Processing

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

arxiv: v1 [cs.ne] 19 Dec 2016

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Deep Learning Recurrent Networks 2/28/2018

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Conditional Language modeling with attention

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Minimal Gated Unit for Recurrent Neural Networks

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Generating Sequences with Recurrent Neural Networks

Gated Recurrent Neural Tensor Network

Deep Gate Recurrent Neural Network

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

EE-559 Deep learning Recurrent Neural Networks

Introduction to Deep Neural Networks

Deep Learning Recurrent Networks 10/11/2017

Recurrent Residual Learning for Sequence Classification

Improved Learning through Augmenting the Loss

Deep Learning for Computer Vision

Deep Recurrent Neural Networks

CSCI 315: Artificial Intelligence through Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Recurrent Neural Networks

Gated Recurrent Networks

arxiv: v1 [cs.cl] 21 May 2017

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Lecture 8: Recurrent Neural Networks

Speech and Language Processing

arxiv: v1 [cs.cl] 31 May 2015

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

arxiv: v1 [stat.ml] 18 Nov 2017

Neural Networks for NLP. COMP-599 Nov 30, 2016

Deep Learning for Automatic Speech Recognition Part II

Better Conditional Language Modeling. Chris Dyer

From perceptrons to word embeddings. Simon Šuster University of Groningen

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Deep Learning for NLP

Character-level Language Modeling with Gated Hierarchical Recurrent Neural Networks

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Convolutional Neural Networks

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Lecture 15: Recurrent Neural Nets

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

IMPROVING PERFORMANCE OF RECURRENT NEURAL

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Transcription:

Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed

Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN backward pass Issues with the Vanilla RNN The Long Short-Term Memory (LSTM) unit The LSTM Forward & Backward pass LSTM variants and tips Peephole LSTM GRU

Motivation Not all problems can be converted into one with fixedlength inputs and outputs Problems such as Speech Recognition or Time-series Prediction require a system to store and use context information Simple case: Output YES if the number of 1s is even, else NO 1000010101 YES, 100011 NO, Hard/Impossible to choose a fixed context window There can always be a new sample longer than anything seen

Recurrent Neural Networks (RNNs) Recurrent Neural Networks take the previous output or hidden states as inputs. The composite input at time t has some historical information about the happenings at time T < t RNNs are useful as their intermediate values (state) can store information about past inputs for a time that is not fixed a priori

Sample Feed-forward Network y 1 h 1 x 1 t = 1 5

Sample RNN y 1 h 1 x 1 t = 1 y 2 h 2 x 2 t = 2 y 3 h 3 x 3 t = 3 6

Sample RNN y 3 h 0 y 1 h 1 x 1 y 2 h 2 x 2 t = 2 h 3 x 3 t = 3 t = 1 7

The Vanilla RNN Cell x t h t-1 W h t h t = tanhw x t h t 1 8

The Vanilla RNN Forward C 1 C 2 C 3 y 1 y 2 y 3 h t = tanhw x t h t 1 h 1 h 2 h 3 y t = F(h t ) C t = Loss(y t,gt t ) x 1 h 0 x 2 h 1 x 3 h 2 9

The Vanilla RNN Forward C 1 C 2 C 3 y 1 h 1 y 2 h 2 y 3 h t = tanhw x t h 3 h t 1 y t = F(h t ) C t = Loss(y t,gt t ) x 1 h 0 x 2 h 1 x 3 h 2 indicates shared weights 10

Recurrent Neural Networks (RNNs) Note that the weights are shared over time Essentially, copies of the RNN cell are made over time (unrolling/unfolding), with different inputs at different time steps

Sentiment Classification Classify a restaurant review from Yelp OR movie review from IMDB OR as positive or negative Inputs: Multiple words, one or more sentences Outputs: Positive / Negative classification The food was really good The chicken crossed the road because it was uncooked

Sentiment Classification RNN h 1 The

Sentiment Classification RNN RNN h 1 h 2 The food

Sentiment Classification h n RNN RNN h 1 h 2 h n-1 RNN The food good

Sentiment Classification Linear Classifier h n RNN RNN h 1 h 2 h n-1 RNN The food good

Sentiment Classification Ignore Ignore Linear Classifier h 1 h 2 h n RNN RNN h 1 h 2 h n-1 RNN The food good

Sentiment Classification h = Sum( ) h 1 h 2 h n RNN RNN h 1 h 2 h n-1 RNN The food good http://deeplearning.net/tutorial/lstm.html

Sentiment Classification Linear Classifier h = Sum( ) h 1 h 2 h n RNN RNN h 1 h 2 h n-1 RNN The food good http://deeplearning.net/tutorial/lstm.html

Image Captioning Given an image, produce a sentence describing its contents Inputs: Image feature (from a CNN) Outputs: Multiple words (let s consider one sentence) : The dog is hiding

Image Captioning RNN CNN

Image Captioning The Linear Classifier h 2 RNN h 1 RNN h 2 CNN

Image Captioning The dog Linear Classifier Linear Classifier h 2 h 3 RNN h 1 RNN RNN h 2 h 3 CNN

RNN Outputs: Image Captions Show and Tell: A Neural Image Caption Generator, CVPR 15

RNN Outputs: Language Modeling VIOLA: Why, Salisbury must find his flesh and thought That which I am not aps, not a man and in fire, To show the reining of the raven and the wars To grace my hand reproach within, and not a fair are hand, That Caesar and my goodly father's world; When I was heaven of presence and our fleets, We spare with hours, but cut thy council I am great, Murdered and by thy master's ready there My power to give thee but so much as hell: Some service in the noble bondman here, Would show him to her wine. KING LEAR: O, if you were a feeble sight, the courtesy of your law, Your sight and several breath, will wear the gods With his heads, and my hands are wonder'd at the deeds, So drop upon your lordship's head, and your opinion Shall be against your honour. http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Input Output Scenarios Single - Single Feed-forward Network Single - Multiple Image Captioning Multiple - Single Sentiment Classification Multiple - Multiple Translation Image Captioning

Input Output Scenarios Note: We might deliberately choose to frame our problem as a particular input-output scenario for ease of training or better performance. For example, at each time step, provide previous word as input for image captioning (Single-Multiple to Multiple-Multiple).

The Vanilla RNN Forward C 1 C 2 C 3 y 1 y 2 y 3 ht = tanhw x t h t 1 y t = F(h t ) h 1 h 2 h 3 C t = Loss(y t,gt t ) Unfold network through time by making copies at each time-step x 1 h 0 x 2 h 1 x 3 h 2 28

BackPropagation Refresher C y = f (x;w ) C = Loss(y, y GT ) y f(x; W) x SGD Update W W η C W C W = C y y W

Multiple Layers C y 1 = f 1 (x;w 1 ) y 2 = f 2 (y 1 ;W 2 ) C = Loss(y 2, y GT ) y 2 f 2 (y 1 ; W 2 ) y 1 f 1 (x; W 1 ) SGD Update W 2 W 2 η C W 2 W 1 W 1 η C W 1 x

Chain Rule for Gradient Computation C y 2 f 2 (y 1 ; W 2 ) y 1 = f 1 (x;w 1 ) y 2 = f 2 (y 1 ;W 2 ) C = Loss(y 2, y GT ) Find C W 1, C W 2 C W 2 = C y 2 y 2 W 2 y 1 f 1 (x; W 1 ) x Application of the Chain Rule C W 1 = C y 1 y 1 W 1 = C y 2 y 2 y 1 y 1 W 1

Chain Rule for Gradient Computation y f(x; W) x Given: y W y x C y We are interested in computing: Intrinsic to the layer are: How does output change due to params How does output change due to inputs C W, C x C W = C y y W C x = C y y x

Chain Rule for Gradient Computation f(x; W) C y C x Given: y W y x C y We are interested in computing: Intrinsic to the layer are: C W How does output change due to params How does output change due to inputs, C x C W = C y y W C x = C y y x Equations for common layers: http://arunmallya.github.io/writeups/nn/backprop.html

Extension to Computational Graphs y 1 y 2 y f(x; W) f 1 (y; W 1 ) f 2 (y; W 2 ) x y y f(x; W) x

Extension to Computational Graphs C y C 1 y 1 f 1 (y; W 1 ) C 2 y 2 f 2 (y; W 2 ) f(x; W) C x C 1 y Σ C 2 y f(x; W) C x

Extension to Computational Graphs C y C 1 y 1 f 1 (y; W 1 ) C 2 y 2 f 2 (y; W 2 ) f(x; W) C x C 1 y Σ C 2 y Gradient Accumulation f(x; W) C x

BackPropagation Through Time (BPTT) One of the methods used to train RNNs The unfolded network (used during forward pass) is treated as one big feed-forward network This unfolded network accepts the whole time series as input The weight updates are computed for each copy in the unfolded network, then summed (or averaged) and then applied to the RNN weights

The Unfolded Vanilla RNN C 1 C 2 C 3 y 1 y 2 y 3 Treat the unfolded network as one big feed-forward network h 1 h 2 h 3 This big network takes in entire sequence as an input Compute gradients through the usual backpropagation Update shared weights h 0 x 1 h 1 x 2 x 3 h 2 38

The Unfolded Vanilla RNN Forward C 1 C 2 C 3 y 1 y 2 y 3 h 1 h 2 h 3 x 1 h 0 x 2 h 1 x 3 h 2 39

The Unfolded Vanilla RNN Backward C 1 C 2 C 3 y 1 y 2 y 3 h 1 h 2 h 3 x 1 h 0 x 2 h 1 x 3 h 2 40

The Vanilla RNN Backward C 1 C 2 C 3 h t = tanhw x t h t 1 y 1 y 2 y 3 y t = F(h t ) C t = Loss(y t,gt t ) h 1 h 2 h 3 C t h 1 = C t y t y t h 1 = C t y t y t h t h t h t 1 h 2 h 1 x 1 h 0 x 2 h 1 x 3 h 2 41

Issues with the Vanilla RNNs In the same way a product of k real numbers can shrink to zero or explode to infinity, so can a product of matrices It is sufficient for λ 1 < 1/γ, where λ 1 is the largest singular value of W, for the vanishing gradients problem to occur and it is necessary for exploding gradients that λ 1 > 1/γ, where γ = 1 for the tanh non-linearity and γ = 1/ 4 for the sigmoid non-linearity 1 Exploding gradients are often controlled with gradient element-wise or norm clipping 1 On the difficulty of training recurrent neural networks, Pascanu et al., 2013

The Identity Relationship Recall C t h 1 = C t y t = C t y t y t h 1 y t h t h t h t 1 h 2 h 1 h t = tanhw y t = F(h t ) x t h t 1 C t = Loss(y t,gt t ) Suppose that instead of a matrix multiplication, we had an identity relationship between the hidden states h t = h t 1 + F(x t ) h t h t 1 = 1 The gradient does not decay as the error is propagated all the way back aka Constant Error Flow

The Identity Relationship Recall C t h 1 = C t y t = C t y t y t h 1 y t h t h t h t 1 h 2 h 1 h t = tanhw y t = F(h t ) x t h t 1 C t = Loss(y t,gt t ) Suppose that instead of a matrix multiplication, we had an identity relationship between the hidden states h t = h t 1 + F(x t ) h t h t 1 = 1 Remember Resnets? The gradient does not decay as the error is propagated all the way back aka Constant Error Flow

Disclaimer The explanations in the previous few slides are handwavy For rigorous proofs and derivations, please refer to On the difficulty of training recurrent neural networks, Pascanu et al., 2013 Long Short-Term Memory, Hochreiter et al., 1997 And other sources

Long Short-Term Memory (LSTM) 1 The LSTM uses this idea of Constant Error Flow for RNNs to create a Constant Error Carousel (CEC) which ensures that gradients don t decay The key component is a memory cell that acts like an accumulator (contains the identity relationship) over time Instead of computing new state as a matrix product with the old state, it rather computes the difference between them. Expressivity is the same, but gradients are better behaved 1 Long Short-Term Memory, Hochreiter et al., 1997 46

The LSTM Idea x t h t-1 W Cell c t h t c t = c t 1 + tanhw * Dashed line indicates time-lag x t h t 1 h t = tanhc t 47

The Original LSTM Cell x t h t-1 x t h t-1 Input Gate W i W o i Output Gate t o t x t h t-1 W Cell c t h t c t = c t 1 + i t tanhw x t h t 1 x t h t = o t tanhc t i t = σ W i h t 1 + b i Similarly for o t 48

The Popular LSTM Cell x t h t-1 x t h t-1 Input Gate W i W o i Output Gate t o t x t h t-1 W Cell c t h t W f f t Forget Gate c t = f t c t 1 + i t tanhw x t h t 1 x t h t-1 f t = σ W f x t h t 1 + b f 49

LSTM Forward/Backward Go To: Illustrated LSTM Forward and Backward Pass 50

Summary RNNs allow for processing of variable length inputs and outputs by maintaining state information across time steps Various Input-Output scenarios are possible (Single/Multiple) Vanilla RNNs are improved upon by LSTMs which address the vanishing gradient problem through the CEC Exploding gradients are handled by gradient clipping More complex architectures are listed in the course materials for you to read, understand, and present 51

Other Useful Resources / References http://cs231n.stanford.edu/slides/winter1516_lecture10.pdf http://www.cs.toronto.edu/~rgrosse/csc321/lec10.pdf R. Pascanu, T. Mikolov, and Y. Bengio, On the difficulty of training recurrent neural networks, ICML 2013 S. Hochreiter, and J. Schmidhuber, Long short-term memory, Neural computation, 1997 9(8), pp.1735-1780 F.A. Gers, and J. Schmidhuber, Recurrent nets that time and count, IJCNN 2000 K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber, LSTM: A search space odyssey, IEEE transactions on neural networks and learning systems, 2016 K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, ACL 2014 R. Jozefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, JMLR 2015 52