Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Similar documents
Long-Short Term Memory and Other Gated RNNs

Recurrent Neural Networks. Jian Tang

arxiv: v3 [cs.lg] 14 Jan 2018

CSC321 Lecture 10 Training RNNs

Lecture 11 Recurrent Neural Networks I

Lecture 15: Exploding and Vanishing Gradients

Sequence Modeling with Neural Networks

Lecture 11 Recurrent Neural Networks I

CSCI 315: Artificial Intelligence through Deep Learning

CSC321 Lecture 15: Exploding and Vanishing Gradients

RECURRENT NETWORKS I. Philipp Krähenbühl

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Recurrent and Recursive Networks

Introduction to RNNs!

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

EE-559 Deep learning LSTM and GRU

Natural Language Processing and Recurrent Neural Networks

CSC321 Lecture 16: ResNets and Attention

Slide credit from Hung-Yi Lee & Richard Socher

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Deep Learning Recurrent Networks 2/28/2018

Recurrent neural networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Stephen Scott.

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

(

Structured Neural Networks (I)

Lecture 17: Neural Networks and Deep Learning

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Lecture 5 Neural models for NLP

Recurrent Neural Network

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Long Short-Term Memory (LSTM)

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Long-Short Term Memory

Neural Architectures for Image, Language, and Speech Processing

Deep Learning Recurrent Networks 10/11/2017

EE-559 Deep learning Recurrent Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Recurrent Neural Networks

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Lecture 15: Recurrent Neural Nets

CSC321 Lecture 15: Recurrent Neural Networks

NEURAL LANGUAGE MODELS

ECE521 Lecture 7/8. Logistic Regression

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Lecture 2: Learning with neural networks

Neural Networks Language Models

From perceptrons to word embeddings. Simon Šuster University of Groningen

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Generating Sequences with Recurrent Neural Networks

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Introduction to Deep Learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Better Conditional Language Modeling. Chris Dyer

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Conditional Language Modeling. Chris Dyer

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Recurrent Neural Networks

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Artificial Neural Networks. MGS Lecture 2

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Lecture 8: Introduction to Deep Learning: Part 2 (More on backpropagation, and ConvNets)

Deep Learning Recurrent Networks 10/16/2017

CSC321 Lecture 5: Multilayer Perceptrons

Lecture 5: Recurrent Neural Networks

Tracking the World State with Recurrent Entity Networks

Task-Oriented Dialogue System (Young, 2000)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Statistical Machine Learning from Data

text classification 3: neural networks

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Backpropagation and Neural Networks part 1. Lecture 4-1

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Natural Language Processing

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Introduction to Machine Learning Spring 2018 Note Neural Networks

ECE521 Lecture7. Logistic Regression

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Neural Networks in Structured Prediction. November 17, 2015

Lecture 2: Modular Learning

Machine Learning. Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 4: Perceptrons and Multilayer Perceptrons

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Deep Learning for Computer Vision

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Transcription:

Recurrent Neural Networks Deep Learning Lecture 5 Efstratios Gavves

Sequential Data So far, all tasks assumed stationary data Neither all data, nor all tasks are stationary though

Sequential Data: Text What

Sequential Data: Text What about

Sequential Data: Text What about inputs that appear in sequences, such as text? Could a neural network handle such modalities?

Memory needed Pr x = Pr x i x 1,, x i 1 ) i What about inputs that appear in sequences, such as text? Could a neural network handle such modalities?

Sequential data: Video

Quiz: Other sequential data?

Quiz: Other sequential data? Time series data Stock exchange Biological measurements Climate measurements Market analysis Speech/Music User behavior in websites...

Applications Click to go to the video in Youtube Click to go to the website

Machine Translation The phrase in the source language is one sequence Today the weather is good The phrase in the target language is also a sequence Погода сегодня хорошая Погода сегодня хорошая <EOS> LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good <EOS> Погода сегодня хорошая Encoder Decoder

Image captioning An image is a thousand words, literally! Pretty much the same as machine transation Replace the encoder part with the output of a Convnet E.g. use Alexnet or a VGG16 network Keep the decoder part to operate as a translator Today the weather is good <EOS> Convnet LSTM LSTM LSTM LSTM LSTM LSTM Today the weather is good

Demo Click to go to the video in Youtube

Question answering Bleeding-edge research, no real consensus Very interesting open, research problems Again, pretty much like machine translation Again, Encoder-Decoder paradigm Insert the question to the encoder part Model the answer at the decoder part Question answering with images also Again, bleeding-edge research How/where to add the image? What has been working so far is to add the image only in the beginning Q: John entered the living room, where he met Mary. She was drinking some wine and watching a movie. What room did John enter? A: John entered the living room. Q: what are the people playing? A: They play beach football

Demo Click to go to the website

Handwriting Click to go to the website

Recurrent Networks Output NN Cell State Recurrent connections Input

Sequences Next data depend on previous data Roughly equivalent to predicting what comes next Pr x = Pr x i x 1,, x i 1 ) i What about inputs that appear in sequences,suc as text?could a neural network h handle such modalities?

Why sequences? Parameters are reused Fewer parameters Easier modelling Generalizes well to arbitrary lengths Not constrained to specific length RecurrentModel( RecurrentModel( I think, therefore, I am! ) Everything is repeated, in a circle. History is a master because it teaches us that it doesn't exist. It's the permutations that matter. ) However, often we pick a frame T instead of an arbitrary length Pr x = Pr x i x i T,, x i 1 ) i

Quiz: What is really a sequence? Data inside a sequence are? McGuire Bond I am Bond, James tired am!

Quiz: What is really a sequence? Data inside a sequence are non i.i.d. Identically, independently distributed The next word depends on the previous words Ideally on all of them We need context, and we need memory! How to model context and memory? McGuire Bond I am Bond, James Bond tired am!

x i One-hot vectors A vector with all zeros except for the active dimension 12 words in a sequence 12 One-hot vectors After the one-hot vectors apply an embedding Word2Vec, GloVE Vocabulary I am Bond James tired, McGuire! I am Bond James tired, McGuire! 1 I am Bond James tired, McGuire! One-hot vectors I 1 am Bond James tired, McGuire! I am Bond James tired, McGuire! 1 1

Quiz: Why not just indices? One-hot representation OR? Index representation x t=1,2,3,4 = I am James McGuire 1 1 1 1 I am James McGuire x "I" = 1 x "am" = 2 x "James" = 4 x "McGuire" = 7

Quiz: Why not just indices? One-hot representation OR? Index representation I am James McGuire 1 x "I" 1 1 1 x "am" x "James" x "McGuire" distance(x "am", x "McGuire" ) = 1 distance(x "I", x "am" ) = 1 I am James McGuire q "I" = 1 q "am" = 2 q "James" = 4 q "McGuire" = 7 distance(q "am", q "McGuire" ) = 5 distance(q "I", q "am" ) = 1 No, because then some words get closer together for no good reason (artificial bias between words)

Memory A representation of the past A memory must project information at timestep t on a latent space c t using parameters θ Then, re-use the projected information from t at t + 1 c t+1 = h(x t+1, c t ; θ) Memory parameters θ are shared for all timesteps t =, c t+1 = h(x t+1, h(x t, h(x t 1, h x 1, c ; θ ; θ); θ); θ)

Memory as a Graph Simplest model Input with parameters U Memory embedding with parameters W Output with parameters V Output parameters Memory parameters Output W Input parameters V U y t c t Input x t Memory

Memory as a Graph Simplest model Input with parameters U Memory embedding with parameters W Output with parameters V Output y t y t+1 Output parameters Memory parameters W V c t W c t+1 Memory Input parameters U Input x t x t+1

Memory as a Graph Simplest model Input with parameters U Memory embedding with parameters W Output with parameters V Output y t y t+1 y t+2 y t+3 Output parameters V Memory parameters W W V W c t c t+1 c t+2 c t+3 Memory V W V W Input parameters U Input x t U U x t+1 x t+2 x t+3 U

Folding the memory Unrolled/Unfolded Network Folded Network y t y t+1 y t+2 y t V V V W W W W c t c t+1 c t+2 c t U U U (ct 1) x t x t+1 x t+2 x t

Recurrent Neural Network (RNN) Only two equations y t c t = tanh(u x t + Wc t 1 ) y t = softmax(v c t ) V U c t W (c t 1 ) x t

RNN Example Vocabulary of 5 words A memory of 3 units Hyperparameter that we choose like layer size c t : 3 1, W: [3 3] An input projection of 3 dimensions U: [3 5] An output projections of 1 dimensions V: [1 3] U x t=4 = c t = tanh(u x t + Wc t 1 ) y t = softmax(v c t )

Rolled Network vs. MLP? What is really different? Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different y 1 y 2 y 3 Final output Layer/Step 1 V Layer/Step 2 V V Layer/Step 3 W x W U 3-gram Unrolled Recurrent Network U W U x Layer 1 Layer 2 Layer 3 W 1 W 2 W 3 y 3-layer Neural Network

Rolled Network vs. MLP? What is really different? Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different y 1 y 2 y 3 Final output Layer/Step 1 V Layer/Step 2 V V Layer/Step 3 x W W W W 1 W 2 W 3 x Layer 1 Layer 2 Layer 3 y U 3-gram Unrolled Recurrent Network U U 3-layer Neural Network

Rolled Network vs. MLP? What is really different? Steps instead of layers Sometimes intermediate outputs are not even needed Removing them, we almost end up to a standard Neural Network Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different y 1 y 2 y 3 Final output Layer/Step 1 V Layer/Step 2 V V Layer/Step 3 x W W W W 1 W 2 W 3 x Layer 1 Layer 2 Layer 3 y U 3-gram Unrolled Recurrent Network U U 3-layer Neural Network

Rolled Network vs. MLP? What is really different? Steps instead of layers Step parameters shared in Recurrent Network In a Multi-Layer Network parameters are different V Layer/Step 1 Layer/Step 2 Layer/Step 3 Final output y W W W W W 1 2 W 3 x x Layer 1 Layer 2 Layer 3 y U 3-gram Unrolled Recurrent Network U U 3-layer Neural Network

Training Recurrent Networks Cross-entropy loss l P = y tk tk L = log P = L t = 1 T t,k t t Backpropagation Through Time (BPTT) Again, chain rule Only difference: Gradients survive over time steps l t log y t

Training RNNs is hard Vanishing gradients After a few time steps the gradients become almost Exploding gradients After a few time steps the gradients become huge Can t capture long-term dependencies

Alternative formulation for RNNs An alternative formulation to derive conclusions and intuitions c t = W tanh(c t 1 ) + U x t + b L = L t (c t ) t

Another look at the gradients L = L(c T (c T 1 c 1 x 1, c ; W ; W ; W ; W) L c t c t c τ = L c t L t t L t c t c t c τ c τ W W = τ=1 c t c t 1 c τ+1 c t 1 c t 2 c τ η t τ L t c t Rest short-term factors t τ long-term factors The RNN gradient is a recursive product of c t c t 1

Exploding/Vanishing gradients L = L c t c T L = L c t c T c T c T 1 c T 1 c T 2 c t+1 c ct < 1 < 1 < 1 c T c T 1 c T 1 c T 2 c 1 c ct > 1 > 1 > 1 L W 1 Vanishing gradient L W 1 Exploding gradient

Vanishing gradients The gradient of the error w.r.t. to intermediate cell L t t W = τ=1 L r y t y t c t c t c τ c τ W c t c τ = t k τ c k = W tanh(c c k 1 ) k 1 t k τ

Vanishing gradients The gradient of the error w.r.t. to intermediate cell L t t W = τ=1 L r y t y t c t c t c τ c τ W c t c τ = t k τ c k = W tanh(c c k 1 ) k 1 t k τ Long-term dependencies get exponentially smaller weights

Gradient clipping for exploding gradients Scale the gradients to a threshold Pseudocode 1. g L W 2. if g > θ : g θ g g g θ g else: g print( Do nothing ) θ Simple, but works!

Rescaling vanishing gradients? Not good solution Weights are shared between timesteps Loss summed over timesteps L = L t L W = L t W t t t t L t W = L t c τ c τ W = L t c t c τ c t c τ W τ=1 τ=1 Rescaling for one timestep ( L t W ) affects all timesteps The rescaling factor for one timestep does not work for another

More intuitively L 1 W L 2 W L = L 1 + L 2 W W W + L 3 + L 4 + L 5 W W W L 3 W L 4 W L 5 W

More intuitively Let s say L 1 1, L 2 1/1, L 3 1/1, L 4 1/1, L 5 1/1 W W W W W L W = L r W = 1.1111 r If L W rescaled to 1 L 5 W 1 5 Longer-term dependencies negligible Weak recurrent modelling Learning focuses on the short-term only L 1 W L = L 1 + L 2 W W W + L 3 + L 4 + L 5 W W W L 2 W L 3 W L 4 W L 5 W

Recurrent networks Chaotic systems

Fixing vanishing gradients Regularization on the recurrent weights Force the error signal not to vanish Ω = Ω t = t t L c t+1 c t+1 c t L c t+1 1 2 Advanced recurrent modules Long-Short Term Memory module Gated Recurrent Unit module Pascanu et al., On the diffculty of training Recurrent Neural Networks, 213

Advanced Recurrent Nets c t 1 + c t tanh f t i t o t c t tanh m t 1 m t Output Input x t

How to fix the vanishing gradients? Error signal over time must have not too large, not too small norm Let s have a look at the loss function L t t W = τ=1 c t c τ = t k τ L r y t y t c t c t c τ c τ W c k c k 1

How to fix the vanishing gradients? Error signal over time must have not too large, not too small norm Let s have a look at the loss function L t t W = τ=1 c t c τ = t k τ L r y t y t c t c t c τ c τ W c k c k 1

How to fix the vanishing gradients? Error signal over time must have not too large, not too small norm Solution: have an activation function with gradient equal to 1 L t t W = τ=1 c t c τ = t k τ L r y t y t c t c t c τ c τ W c k c k 1 Identify function has a gradient equal to 1 By doing so, gradients do not become too small not too large

Long Short-Term Memory LSTM: Beefed up RNN Simple RNN c t = W tanh(c t 1 ) + U x t + b LSTM i = x t U (i) + m t 1 W (i) f = x t U (f) + m t 1 W (f) o = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t 1 f t i t + o t tanh c t c t = c t 1 f + c t i m t = tanh c t o c t tanh - The previous state c t 1 is connected to new c t with no nonlinearity (identity function). - The only other factor is the forget gate f which rescales the previous LSTM state. m t 1 x t Input m t Output More info at: http://colah.github.io/posts/215-8-understanding-lstms/

Cell state The cell state carries the essential information over time Cell state line i = x t U (i) + m t 1 W (i) f = x t U (f) + m t 1 W (f) o = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t 1 f t i t + o t tanh c t c t = c t 1 f + c t i m t = tanh c t o c t tanh m t 1 m t xt

LSTM nonlinearities (, 1): control gate something like a switch tanh 1, 1 : recurrent nonlinearity i = x t U (i) + m t 1 W (i) f = x t U (f) + m t 1 W (f) o = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t 1 f t i t + o t tanh c t c t = c t 1 f + c t i m t = tanh c t o c t tanh m t 1 m t Output Input x t

LSTM Step-by-Step: Step (1) E.g. LSTM on Yesterday she slapped me. Today she loves me. Decide what to forget and what to remember for the new memory Sigmoid 1 Remember everything Sigmoid Forget everything i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) c t 1 f t i t + o t tanh c t o t = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t tanh c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Step-by-Step: Step (2) Decide what new information is relevant from the new input and should be add to the new memory Modulate the input i t Generate candidate memories c t c t 1 + c t i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) f t i t o t tanh o t = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t tanh c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Step-by-Step: Step (3) Compute and update the current cell state c t Depends on the previous cell state What we decide to forget What inputs we allow The candidate memories i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) o t = x t U (o) + m t 1 W (o) c t 1 c t = tanh(x t U g + m t 1 W (g) ) f t i t + c t tanh o t tanh c t c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Step-by-Step: Step (4) Modulate the output Does the new cell state relevant? Sigmoid 1 If not Sigmoid Generate the new memory c t 1 + c t i t = x t U (i) + m t 1 W (i) f t = x t U (f) + m t 1 W (f) f t i t o t tanh o t = x t U (o) + m t 1 W (o) c t = tanh(x t U g + m t 1 W (g) ) c t tanh c t = c t 1 f + c t i m t 1 m t m t = tanh c t o xt

LSTM Unrolled Network Macroscopically very similar to standard RNNs The engine is a bit different (more complicated) Because of their gates LSTMs capture long and short term dependencies + + + tanh tanh tanh tanh tanh tanh

Beyond RNN & LSTM LSTM with peephole connections Gates have access also to the previous cell states c t 1 (not only memories) Coupled forget and input gates, c t = f t c t 1 + 1 f t c t Bi-directional recurrent networks Gated Recurrent Units (GRU) Deep recurrent architectures Recursive neural networks Tree structured Multiplicative interactions Generative recurrent architectures LSTM (2) LSTM (1) LSTM (2) LSTM (1) LSTM (2) LSTM (1)

Take-away message Recurrent Neural Networks (RNN) for sequences Backpropagation Through Time Vanishing and Exploding Gradients and Remedies RNNs using Long Short-Term Memory (LSTM) Applications of Recurrent Neural Networks

Thank you!