Gated Recurrent Networks

Similar documents
Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Lecture 11 Recurrent Neural Networks I

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Structured Neural Networks (I)

arxiv: v3 [cs.lg] 14 Jan 2018

Introduction to RNNs!

Recurrent Neural Network

Recurrent Neural Networks. Jian Tang

CSC321 Lecture 10 Training RNNs

Long-Short Term Memory and Other Gated RNNs

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Sequence Modeling with Neural Networks

Deep Learning Autoencoder Models

Recurrent Neural Networks

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CSC321 Lecture 15: Exploding and Vanishing Gradients

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

CSCI 315: Artificial Intelligence through Deep Learning

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Recurrent neural networks

Slide credit from Hung-Yi Lee & Richard Socher

Lecture 15: Exploding and Vanishing Gradients

Deep Learning Recurrent Networks 2/28/2018

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Natural Language Processing and Recurrent Neural Networks

Generating Sequences with Recurrent Neural Networks

arxiv: v1 [cs.cl] 21 May 2017

Long-Short Term Memory

Lecture 17: Neural Networks and Deep Learning

Introduction to Deep Neural Networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Stephen Scott.

Neural Networks Language Models

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Based on the original slides of Hung-yi Lee

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

CSE 446 Dimensionality Reduction, Sequences

EE-559 Deep learning LSTM and GRU

On the use of Long-Short Term Memory neural networks for time series prediction

Dynamic Approaches: The Hidden Markov Model

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Lecture 8: Recurrent Neural Networks

RECURRENT NETWORKS I. Philipp Krähenbühl

CSC321 Lecture 16: ResNets and Attention

Deep Learning for Computer Vision

Lecture 5 Neural models for NLP

Faster Training of Very Deep Networks Via p-norm Gates

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Learning Long-Term Dependencies with Gradient Descent is Difficult

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Combining Static and Dynamic Information for Clinical Event Prediction

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

ECE521 Lectures 9 Fully Connected Neural Networks

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Recurrent Neural Networks

Deep Learning for NLP

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

From perceptrons to word embeddings. Simon Šuster University of Groningen

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

11/3/15. Deep Learning for NLP. Deep Learning and its Architectures. What is Deep Learning? Advantages of Deep Learning (Part 1)

EE-559 Deep learning Recurrent Neural Networks

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

NEURAL LANGUAGE MODELS

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990)

CSC321 Lecture 15: Recurrent Neural Networks

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Architectures for Image, Language, and Speech Processing

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

CS224N: Natural Language Processing with Deep Learning Winter 2018 Midterm Exam

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

(2pts) What is the object being embedded (i.e. a vector representing this object is computed) when one uses

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Natural Language Processing

Deep Learning Lab Course 2017 (Deep Learning Practical)

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Gated Recurrent Neural Tensor Network

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Recurrent and Recursive Networks

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Transcription:

Gated Recurrent Networks Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR)

Outline Recurrent Neural Networks Gradient Issues Dealing with Sequences in NN t = 0 t = 1 t = 2 t = N Variable size data describing sequentially dependent information c 3 c N Neural models need to capture dynamic context c t to perform predictions Recurrent Neural Network Fully adaptive (Elman, SRN, ) Randomized approaches (Reservoir Computing) Introduce (deep) gated recurrent networks

Outline Recurrent Neural Networks Gradient Issues Lecture Outline RNN Repetita Motivations Learning long-term dependencies is difficult Gradient issues Gated RNN Long-Short Term Memories (LSTM) Gated Recurrent Units (GRU) Advanced topics Understanding and exploiting memory encoding

Outline Recurrent Neural Networks (RNN) Gradient Issues Unfolding RNN (Forward Pass) By now you should be familiar with the concept of model unfolding/unrolling on the data Graphics credit @ colah.github.io model q 1 data x 0 x 1 x 2 x t memory encoding unfolding Map an arbitrary length sequence x 0.. x t to fixedlength encoding h t

Outline Recurrent Neural Networks (RNN) Gradient Issues Supervised Recurrent Tasks output hidden input element to element sequence to item item to sequence sequence to sequence Graphics credit @ karpathy.github.io

Outline Recurrent Neural Networks (RNN) Gradient Issues A Non-Gated RNN (a.k.a. Vanilla) y t = f(w out h t + b out ) h t 1 h t h t i g t i h t = tanh(g t ) W h i h t 1 x t i W in x t g t h t 1, x t = W h h t 1 + W in x t + b h

Outline Recurrent Neural Networks Gradient Issues Learning to Encode Input History Hidden state h t summarizes information on the history of the input signal up to time t

Outline Recurrent Neural Networks Gradient Issues Learning Long-Term Dependencies is Difficult When the time gap between the observation and the state grows there is little residual information of the input inside of the memory What is the cause? J. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen, TUM, 1991

Outline Recurrent Neural Networks Gradient Issues Exploding/Vanishing Gradient Short story: Gradients propagated over many stages tend to Vanish (often) No learning Explode (rarely) Instability and oscillations Weights are shared between time steps sum gradient contributions through time h 1 h t 2 h t 1 δl t δh t y t h t δl t δw δh t 2 δh t 1 δh t x 1 δh t 3 x t 2 δh t 2 x t 1 δh t 1 x t Bengio, Simard and Frasconi, Learning long-term dependencies with gradient descent is difficult. TNN, 1994

Outline Recurrent Neural Networks Gradient Issues A Closer Look at the Gradient δl t δw = t k=1 δl t δh t δh k δh t δh k δw This is a parameter matrix we have a Jacobian Inside here you have chain rule δh t δh k = δh t δh t 1 δh t 1 δh t 2 δh k+1 δh k δl t δw = t k=1 δl t δh t t 1 δh l+1 l=k δh l δh k δw The gradient is a recursive product of hidden activation gradients (Jacobian)

Outline Recurrent Neural Networks Gradient Issues Bounding the Gradient (I) Given h l = tanh(w h h l 1 + W in x l ) then δh l+1 δh l = D l+1 W l T where the activation Jacobian is D l+1 = diag(1 tanh 2 W h h l + W in x l+1 ) δl t δh t = δl t δh k t 1 δhl+1 l=k δh l = δl t 1 t δh k l=k D l+1 W l T We are interested in the gradient magnitude δl t δh t

Outline Recurrent Neural Networks Gradient Issues Bounding the Gradient (II) δl t = δl t 1 t δh t δh k l=k D l+1 W l T δl t δh k t 1 l=k D l+1 W l T = δl t δh k t 1 l=k σ D l+1 σ W l T Bounded by the spectral radius σ Can shrink to zero or increase exponentially depending on the spectral properties σ < 1 vanishish σ > 1 exploding

Outline Recurrent Neural Networks Gradient Issues Gradient Clipping for Exploding Gradients Take g = δl t If g δw > θ 0 then g = θ 0 g Rescaling does not work for gradient vanish as total gradient is a sum of time dependent gradients (preserving relative contribution from each time δl t makes it exponentially decay) = t δw k=1 g δl t δh t δh t δh k

Gated Networks Long-Short Term Memory Gated Recurrent Units Tackling Gradient Issues Solution seems to be having the Jacobian with σ = 1 (activation function?) Linear dominated by the eigenvalues of W h to the power of t Linear with weight 1 (state identity) h t = h t 1 + c(x t ) Has the desired spectral properties but does not work in practice as it quickly saturates memory (e.g. with replicated/non-useful inputs and states)

Gated Networks Long-Short Term Memory Gated Recurrent Units Gating Units Mixture of experts the origin of gating + output Local Expert 1 Local Expert 2 Local Expert K Softmax Gating Network input x Jacobs et al (1991), Adaptive Mixtures of Local Experts,

Gated Networks Long-Short Term Memory Gated Recurrent Units Long-Short Term Memory (LSTM) Cell h t 1 h t Lets start from the vanilla RNN unit g t + x t S. Hochreiter, J. Schmidhuber, Long short-term memory". Neural Computation, Neural Comp. 1997

Gated Networks Long-Short Term Memory Gated Recurrent Units LSTM Design Step 1 h t 1 h t Introduce a linear/identity memory c t c t 1 + + c t g t Combines past internal state c t 1 with current input x t x t

Gated Networks Long-Short Term Memory Gated Recurrent Units LSTM Design Step 2 (Gates) h t 1 c t 1 h t + c t Input gate Controls how inputs contribute to the internal state g t + I t (x t, h t 1 ) + Logistic sigmoid x t

Gated Networks Long-Short Term Memory Gated Recurrent Units LSTM Design Step 2 (Gates) h t 1 c t 1 h t c t + Forget gate Controls how past internal state c t 1 contributes to c t F t (x t, h t 1) + g t + + Logistic sigmoid x t

Gated Networks Long-Short Term Memory Gated Recurrent Units LSTM Design Step 2 (Gates) h t 1 c t 1 + h t c t + Output gate Controls what part of the internal state is propagated out of the cell + g t + O t (x t, h t 1 ) + Logistic sigmoid x t

Gated Networks Long-Short Term Memory Gated Recurrent Units LSTM in Equations 1) Compute activation of input and forget gates I t = σ(w Ih h t 1 + W Iin x t + b I ) F t = σ(w Fh h t 1 + W Fin x t + b F ) 2) Compute input potential and internal state g t = tanh(w h h t 1 + W in x t + b h ) c t = F t c t 1 + I t g t 3) Compute output gate and output state O t = σ(w Oh h t 1 + W Oin x t + b O ) h t = O t tanh(c t ) element-wise multiplication

Gated Networks Long-Short Term Memory Gated Recurrent Units Deep LSTM 1 h t 1 3 h t 1 1 h t h t 3 x t LSTM CELL LSTM CELL LSTM CELL 2 h t 1 h t 2 y t LSTM layers extract information at increasing levels of abstraction (enlarging context)

Gated Networks Long-Short Term Memory Gated Recurrent Units Training LSTM Original LSTM training algorithm was a mixture of RTRL and BPTT BPTT on internal state gradient RTRL-like truncation on other recurrent connections No exact gradient calculation! All current LSTM implementation use full BPTT training Introduced by Graves and Schmidhuber in 2005 Typically use Adam or RMSProp optimizer

Gated Networks Long-Short Term Memory Gated Recurrent Units Regularizing LSTM - Dropout Randomly disconnect units from the network during training N. Srivastava et al, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JLMR 2014

Gated Networks Long-Short Term Memory Gated Recurrent Units Regularizing LSTM - Dropout Randomly disconnect units from the network during training N. Srivastava et al, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JLMR 2014

Gated Networks Long-Short Term Memory Gated Recurrent Units Regularizing LSTM - Dropout Randomly disconnect units from the network during training N. Srivastava et al, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JLMR 2014

Gated Networks Long-Short Term Memory Gated Recurrent Units Regularizing LSTM - Dropout Randomly disconnect units from the network during training Regulated by unit dropping hyperparameter Prevents unit coadaptation Committee machine effect Need to adapt prediction phase Drop units for the whole sequence! N. Srivastava et al, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JLMR 2014

Gated Networks Long-Short Term Memory Gated Recurrent Units Regularizing LSTM - Dropout Randomly disconnect units from the network during training x x You can also drop single connections (dropconnect) x x x Regulated by unit dropping hyperparameter Prevents unit coadaptation Committee machine effect Need to adapt prediction phase Drop units for the whole sequence!

Gated Networks Long-Short Term Memory Gated Recurrent Units Practicalities Minibatch and Truncated BP Minibatch (MB) Training Data Average gradient on full data Stochastic GD sequenceby-sequence MB1 MB2 MBn δl MB1 δw δl MB2 δw δl MBn δw Truncated gradient propagation h k 1 x h k h k+1 h 1 h T

Gated Networks Long-Short Term Memory Gated Recurrent Units Gated Recurrent Unit (GRU) Reset acts directly on output state (no internal state and no output gate) h t 1 + h t h t = (1 z t ) h t 1 + z t h t h t = tanh(w hh (r t h t 1 ) + W hin x t + b h ) 1 Reset and update gates when coupled act as input and forget gates + z t + r t h t z t = σ(w zh h t 1 + W zin x t + b z ) + r t = σ(w rh h t 1 + W rin x t + b r ) C. Kyunghyun et al, Learning Phrase Representations using RNN Encoder- Decoder for Statistical Machine Translation, EMNLP 2014 x t

Advanced Models & Software Conclusions Bidirectional LSTM Character Recognition Character distribution 1 output for each character plus no output symbol LSTM layers Preprocessed input Original input A. Graves, A novel connectionist system for unconstrained handwriting recognition, TPAMI 2009

Advanced Models & Software Conclusions Bidirectional LSTM Character Recognition Character distribution 1 output for each character plus no output symbol LSTM left-to-right LSTM right-to-left LSTM layers Preprocessed input Original input A. Graves, A novel connectionist system for unconstrained handwriting recognition, TPAMI 2009

Advanced Models & Software Conclusions Generative Use of LSTM/GRU e l l Teacher forcing at training time LSTM3 LSTM2 LSTM1 Refeeding output at prediction time H e l A. Graves, Generating Sequences With Recurrent Neural Networks, 2013 Bypass connections

Advanced Models & Software Conclusions Character Generation Fun Shakespeare PANDARUS: Alas, I think he shall be come approached and the day When little srain would be attain'd into being never fed, And who is but a chain and subjects of his death, I should not sleep. Second Senator: They are away this miseries, produced upon my soul, Breaking and strongly should be buried, when I perish The earth and thoughts of many states. http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Advanced Models & Software Conclusions Character Generation Fun Linux Kernel Code /* * If this error is set, we will need anything right after that BSD. */ static void action_new_function(struct s_stat_info *wb) { unsigned long flags; int lel_idx_bit = e->edd, *sys & ~((unsigned long) *FIRST_COMPAT); buf[0] = 0xFFFFFFFF & (bit << 4); min(inc, slist->bytes); printk(kern_warning "Memory allocated %02x/%02x, " "original MLL instead\n"), min(min(multi_run - s->len, max) * num_data_in), frame_pos, sz + first_seg); div_u64_w(val, inb_p); spin_unlock(&disk->queue_lock); mutex_unlock(&s->sock->mutex); mutex_unlock(&func->mutex); return disassemble(info->pending_bh); } http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Advanced Models & Software Conclusions Generate Sad Jokes A 3-LSTM layers neural network to generate English jokes character by character Why did the boy stop his homework? Because they re bunny boo! Q: Why did the death penis learn string? A: Because he wanted to have some roasts case! What do you get if you cross a famous California little boy with an elephant for players? Market holes.

Advanced Models & Software Conclusions Understanding Memory Representation At Layer-3 neuron show some form of context induced representation of subsequences (words)

Advanced Models & Software Conclusions Understanding Memory Representation Neurons in early recurrent layers tend to organize according to sequence suffix

Advanced Models & Software Conclusions Recursive Gated Networks Recursive LSTM cell for binary trees Unfolding on parse trees for sentiment analysis

Advanced Models & Software Conclusions Encoder-Decoder Architectures The sequence to sequence framework Deep LSTM model LSTM activations used as encoding of the full input sequence to seed output sentence generation I Sutskever et al, Sequence to Sequence Learning with Neural Networks, NIPS, 2014

Advanced Models & Software Conclusions More Differentiable Compositions CNN-LSTM Composition for image-to-sequence (NeuralTalk) A cat is sitting on a toilet seat A. Karpathy and L. Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR 2015 https://github.com/karpathy/neuraltalk2 A woman holding a teddy bear in front of a mirror

Advanced Models & Software Conclusions RNN A Modern View RNN are only for sequential/structured data? RNN as stateful systems

Advanced Models & Software Conclusions Software Standard LSTM and GRU are available in all deep learning frameworks (Python et al) as well as in Matlab s Neural Network Toolbox If you want to play with one-element ahead sequence generation try out char-rnn implementations https://github.com/karpathy/char-rnn (ORIGINAL) https://github.com/sherjilozair/char-rnn-tensorflow https://github.com/crazydonkey200/tensorflow-char-rnn http://pytorch.org/tutorials/intermediate/char_rnn_generati on_tutorial.html

Advanced Models & Software Conclusions Take Home Messages Learning long-term dependencies can be difficult due to gradient vanish/explosion Gated RNN solution Gates are neurons whose output is used to scale another neuron s output Use gates to determine what information can enter (or exit) the internal state Training gated RNN non always straightforward Deep RNN can be used in generative mode Can seed the network with neural embeddings Deep RNN as stateful and differentiable machines

Deep Autoencoder Advanced Models & Software Conclusions Next Lecture Next week will have two practical lectures on deep learning frameworks Antonio Carta Pytorch Francesco Crecchi Tensorflow Coming up next Advanced DL topics Attention models, Variational Deep Learning, Midterm 2 discussions Thursday 3 rd May 2018 - h.11-13 - Room N