Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Similar documents
Recurrent Neural Networks. Jian Tang

Lecture 11 Recurrent Neural Networks I

Slide credit from Hung-Yi Lee & Richard Socher

EE-559 Deep learning LSTM and GRU

Introduction to RNNs!

Lecture 11 Recurrent Neural Networks I

Long-Short Term Memory and Other Gated RNNs

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Network

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Learning Long-Term Dependencies with Gradient Descent is Difficult

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Neural Networks Language Models

Natural Language Processing and Recurrent Neural Networks

Lecture 5: Recurrent Neural Networks

arxiv: v3 [cs.lg] 14 Jan 2018

Recurrent neural networks

Long-Short Term Memory

RECURRENT NETWORKS I. Philipp Krähenbühl

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

CSC321 Lecture 15: Exploding and Vanishing Gradients

Faster Training of Very Deep Networks Via p-norm Gates

Based on the original slides of Hung-yi Lee

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Stephen Scott.

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Recurrent Neural Networks

Lecture 15: Exploding and Vanishing Gradients

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Gated Recurrent Neural Tensor Network

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

arxiv: v1 [stat.ml] 18 Nov 2017

Seq2Tree: A Tree-Structured Extension of LSTM Network

CSC321 Lecture 10 Training RNNs

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

arxiv: v1 [cs.cl] 21 May 2017

Deep Learning Recurrent Networks 2/28/2018

Lecture 17: Neural Networks and Deep Learning

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Combining Static and Dynamic Information for Clinical Event Prediction

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Generating Sequences with Recurrent Neural Networks

Deep Learning for Computer Vision

Contents. (75pts) COS495 Midterm. (15pts) Short answers

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Lecture 8: Recurrent Neural Networks

ARTIFICIAL neural networks (ANNs) are made from

Gated Recurrent Networks

On the use of Long-Short Term Memory neural networks for time series prediction

arxiv: v1 [cs.cl] 31 May 2015

Recurrent Neural Networks

CSCI 315: Artificial Intelligence through Deep Learning

Deep Learning. Convolutional Neural Network (CNNs) Ali Ghodsi. October 30, Slides are partially based on Book in preparation, Deep Learning

CSC321 Lecture 16: ResNets and Attention

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Improved Learning through Augmenting the Loss

Lecture 4 Backpropagation

Structured Neural Networks (I)

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Deep Learning for Automatic Speech Recognition Part II

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Sequence Modeling with Neural Networks

arxiv: v1 [cs.lg] 2 Feb 2018

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v1 [cs.ne] 19 Dec 2016

Introduction to Deep Neural Networks

Deep Feedforward Networks

SenseGen: A Deep Learning Architecture for Synthetic Sensor Data Generation

Tracking the World State with Recurrent Entity Networks

IMPROVING PERFORMANCE OF RECURRENT NEURAL

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Recurrent and Recursive Networks

CAN RECURRENT NEURAL NETWORKS WARP TIME?

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Chinese Character Handwriting Generation in TensorFlow

Deep Recurrent Neural Networks

Implicitly-Defined Neural Networks for Sequence Labeling

arxiv: v3 [cs.lg] 25 Oct 2017

Based on the original slides of Hung-yi Lee

Deep Gate Recurrent Neural Network

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

Notes on Adversarial Examples

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

EE-559 Deep learning Recurrent Neural Networks

Character-level Language Modeling with Gated Hierarchical Recurrent Neural Networks

Deep Learning Recurrent Networks 10/11/2017

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

NEURAL LANGUAGE MODELS

Transcription:

Recurrent Neural Network (RNNs) University of Waterloo October 23, 2015 Slides are partially based on Book in preparation, by Bengio, Goodfellow, and Aaron Courville, 2015

Sequential data Recurrent neural networks (RNNs) are often used for handling sequential data. They introduced first in 1986 (Rumelhart et al 1986). Sequential data usually involves variable length inputs.

Parameter sharing Parameter sharing makes it possible to extend and apply the model to examples of different lengths and generalize across them. x 1 x 2 y 1 x 1 x 2 y 1

Recurrent neural network Figure: Richard Socher

Dynamic systems The classical form of a dynamical system: s t = f θ (s t 1 )

Dynamic systems Now consider a dynamic system with an external signal x s t = f θ (s t 1, x t ) The state contains information about the whole past sequence. s t = g t (x t, x t 1, x t s,..., x 2, x 1 )

Parameter sharing We can think of s t as a summary of the past sequence of inputs up to t. If we define a different function g t for each possible sequence length, we would not get any generalization. If the same parameters are used for any sequence length allowing much better generalization properties.

Recurrent Neural Networks a t = b + W s t 1 + Ux t s t = tanh(a t ) o t = c + V s t p t = softmax(o t )

Computing the Gradient in a Recurrent Neural Network Using the generalized back-propagation algorithm one can obtain the so-called Back-Propagation Through Time (BPTT) algorithm.

Exploding or Vanishing Product of Jacobians In recurrent nets (also in very deep nets), the final output is the composition of a large number of non-linear transformations. Even if each of these non-linear transformations is smooth. Their composition might not be. The derivatives through the whole composition will tend to be either very small or very large.

Exploding or Vanishing Product of Jacobians The Jacobian (matrix of derivatives) of a composition is the product of the Jacobians of each stage. If where (f g)(x) = f (g(x)) f = f T f T 1..., f 2 f 1 (f g) (x) = (f g)(x) g (x) = f (g(x))g (x)

Exploding or Vanishing Product of Jacobians The Jacobian matrix of derivatives of f (x) with respect to its input vector x is where and f = f T f T 1..., f 2f 1 f = f (x) x f t = f t(a t ) a t, where a t = f t 1 (f t 2 (..., f 2 (f 1 (x)))).

Exploding or Vanishing Product of Jacobians Simple example Suppose: all the numbers in the product are scalar and have the same value α. multiplying many numbers together tends to be either very large or very small. If T goes to, then α T goes to if α > 1 α T goes to 0 if α < 1

Difficulty of Learning Long-Term Dependencies Consider a general dynamic system s t = f θ (s t 1, x t ) o t = g w (s t ), A loss L t is computed at time step t as a function of o t and some target y t. At time T : L T θ L T θ = t T = t T L t s t s t θ L t s T f θ (s t 1, x t ) s T s t θ s T s t = s T s T 1 s T 1 s T 2... s t+1 s t

Facing the challenge Gradients propagated over many stages tend to either vanish (most of the time) or explode.

Echo State Networks set the recurrent and input weights such that the recurrent hidden units do a good job of capturing the history of past inputs, and only learn the output weights. s t = σ(w s t 1 + Ux t )

Echo State Networks If a change s in the state at t is aligned with an eigenvector v of jacobian J with eigenvalue λ > 1, then the small change s becomes λ s after one time step, and λ t s after t time steps. If the largest eigenvalueλ < 1, the map from t to t + 1 is contractive. The network forgetting information about the long-term past. Set the weights to make the Jacobians slightly contractive.

Long delays Use recurrent connections with long delays. Figure from Benjio et al 2015

Leaky Units Recall that s t = σ(w s t 1 + Ux t ) Consider s t,i = (1 1 )s t 1 + 1 σ(w s t 1 + Ux t ) τ i τ i 1 τ i τ i = 1, Ordinary RNN τ i > 1, gradients propagate more easily. τ i >> 1, the state changes very slowly, integrating the past values associated with the input sequence.

Gated RNNs It might be useful for the neural network to forget the old state in some cases. Example: a a b b b a a a a b a b It might be useful to keep the memory of the past. Example: Instead of manually deciding when to clear the state, we want the neural network to learn to decide when to do it.

Gated RNNs, the Long-Short-Term-Memory The Long-Short-Term-Memory (LSTM) algorithm was proposed in 1997 (Hochreiter and Schmidhuber, 1997). Several variants of the LSTM are found in the literature: Hochreiter and Schmidhuber 1997 Graves, 2012 Graves et al., 2013 Sutskever et al., 2014 the principle is always to have a linear self-loop through which gradients can flow for long duration.

Gated Recurrent Units (GRU) Recent work on gated RNNs, Gated Recurrent Units (GRU) was proposed in 2014 Cho et al., 2014 Chung et al., 2014, 2015 Jozefowicz et al., 2015 Chrupala et al., 2015

Gated RNNs

Gated Recurrent Units (GRU) Standard RNN computes hidden layer at next time step directly: h t = f (Wh t 1 + Ux t ) GRU first computes an update Gate (another layer) based on current input vector and hidden state z t = σ(w (z) x t + U (z) h t 1 ) compute reset gate similarly but with different weights r t = σ(w (r) x t + U (r) h t 1 ) Credit: Richard Socher

Gated Recurrent Units (GRU) Update gate: z t = σ(w (z) x t + U (z) h t 1 ) Reset gate: r t = σ(w (r) x t + U (r) h t 1 ) New memory content: h t = tanh(wx t + r t Uh t 1 ) If reset gate is 0, then this ignores previous memory and only stores the new information Final memory at time step combines current and previous time steps: h t = z t h h 1 + (1 z t ) h t

Gated Recurrent Units (GRU) Update gate: z t = σ(w (z) x t + U (z) h t 1 ) Reset gate: r t = σ(w (r) x t + U (r) h t 1 ) New memory content: h t = tanh(wx t + r t Uh t 1 ) h t = z t h h 1 + (1 z t ) h t

Gated Recurrent Units (GRU) z t = σ(w (z) x t + U (z) h t 1 ) r t = σ(w (r) x t + U (r) h t 1 ) h t = tanh(wx t + r t Uh t 1 ) h t = z t h h 1 + (1 z t ) h t If reset is close to 0, ignore previous hidden state Allow model to drop information that is irrelevant Update gate z controls how much of past state should matter now. If z close to 1, then we can copy information in that unit through many time steps. Units with short term dependencies often have reset gates very active.

The Long-Short-Term-Memory (LSTM) We can make the units even more complex Allow each time step to modify Input gate (current cell matters) i t = σ(w (i) x t + U (i) h t 1 ) Forget (gate 0, forget past) f t = σ(w (f ) x t + U (f ) h t 1 ) Output (how much cell is exposed) o t = σ(w (o) x t + U (o) h t 1 ) New memory cell c t = tanh(w (c) x t + U (c) h t 1 ) Final memory cell: c t = f t c t 1 + (i t ) c t Final hidden state: h t = o t tanh(c t )

Clipping Gradients Figure: Pascanu et al., 2013 Strongly non-linear functions tend to have derivatives that can be either very large or very small in magnitude.

Clipping Gradients Simple solution for clipping the gradient. (Mikolov, 2012; Pascanu et al., 2013): Clip the parameter gradient from a mini batch element-wise (Mikolov, 2012) just before the parameter update. Clip the norm g of the gradient g (Pascanu et al., 2013a) just before the parameter update.