Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Similar documents
Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Introduction to RNNs!

Long-Short Term Memory and Other Gated RNNs

arxiv: v3 [cs.lg] 14 Jan 2018

Slide credit from Hung-Yi Lee & Richard Socher

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

EE-559 Deep learning LSTM and GRU

Lecture 11 Recurrent Neural Networks I

Based on the original slides of Hung-yi Lee

Lecture 11 Recurrent Neural Networks I

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Natural Language Processing and Recurrent Neural Networks

Sequence Modeling with Neural Networks

EE-559 Deep learning Recurrent Neural Networks

RECURRENT NETWORKS I. Philipp Krähenbühl

Stephen Scott.

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Deep Learning Recurrent Networks 2/28/2018

Neural Networks Language Models

Deep Recurrent Neural Networks

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Sequence Models. Ji Yang. Department of Computing Science, University of Alberta. February 14, 2018

Recurrent and Recursive Networks

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Gated Recurrent Neural Tensor Network

Improved Learning through Augmenting the Loss

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

CSC321 Lecture 10 Training RNNs

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Long-Short Term Memory

Recurrent Neural Networks. Armand Joulin Facebook AI research

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Faster Training of Very Deep Networks Via p-norm Gates

CSC321 Lecture 15: Exploding and Vanishing Gradients

arxiv: v2 [cs.ne] 7 Apr 2015

Recurrent neural networks

Learning Long-Term Dependencies with Gradient Descent is Difficult

CSC321 Lecture 16: ResNets and Attention

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

CSCI 315: Artificial Intelligence through Deep Learning

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

arxiv: v1 [cs.cl] 21 May 2017

NEURAL LANGUAGE MODELS

Recurrent Neural Networks

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

How to do backpropagation in a brain

Lecture 17: Neural Networks and Deep Learning

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Presented By: Omer Shmueli and Sivan Niv

CS230: Lecture 10 Sequence models II

Lecture 15: Exploding and Vanishing Gradients

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Architectures for Image, Language, and Speech Processing

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

(

Sequence Transduction with Recurrent Neural Networks

On the use of Long-Short Term Memory neural networks for time series prediction

arxiv: v1 [cs.ne] 14 Nov 2012

Generating Sequences with Recurrent Neural Networks

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Recurrent Neural Network

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

arxiv: v1 [cs.cl] 31 May 2015

Unfolded Recurrent Neural Networks for Speech Recognition

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Lecture 5: Recurrent Neural Networks

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

Combining Static and Dynamic Information for Clinical Event Prediction

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks

Memory-Augmented Attention Model for Scene Text Recognition

Deep Learning: a gentle introduction

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Statistical NLP for the Web

Seq2Tree: A Tree-Structured Extension of LSTM Network

Deep Learning for Computer Vision

Framewise Phoneme Classification with Bidirectional LSTM Networks

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Deep learning for Natural Language Processing and Machine Translation

Biologically Plausible Speech Recognition with LSTM Neural Nets

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Task-Oriented Dialogue System (Young, 2000)

CSC321 Lecture 15: Recurrent Neural Networks

arxiv: v1 [cs.ne] 19 Dec 2016

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Structured Neural Networks (I)

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Natural Language Processing

Transcription:

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech recognition, Machine translation Main problems in training

Major Shortcomings Handling of complex non-linear interactions Difficulties using BPTT to capture long-term dependencies Exploding gradients Vanishing gradients

Handling Non-Linear Interactions

Handling Non-Linear Interactions have depth not only in temporal dimension but also in space (at each time step) empirically shown to provide significant improvement in tasks like ASR, Unsupervised training using videos

Handling Non-Linear Interactions Gated RNNs shown to work on character based language modeling Sutskever et.al., 2011:Generating Test with Recurrent Networks

Training: Exploding Gradients Gradient Clipping during BPTT

Training: Exploding Gradients Gradient Clipping during BPTT

Training: Vanishing Gradients Multiple schools of thought better initialization of the recurrent matrix and using momentum during training Sutskever et.al.,: On The Importance of Initialization and Momentum in Deep Learning modifying the architecture

Structurally Constrained RNNs y t R h t U x t A Mikolov et.al., 2015:Learning Longer Memory in Recurrent Neural Networks

Structurally Constrained RNNs R y t h t U R h t U P y t V s t A A B x t x t s t = (1 )Bx t + s t 1, h t = (Ps t + Ax t + Rh t 1 ), y t = f (Uh t + Vs t )

Structurally Constrained RNNs Language Modeling on Penntree Bank Corpus Model #hidden #context Validation Perplexity Test Perplexity Ngram - - - 141 Ngram + cache - - - 125 SRN 50-153 144 SRN 100-137 129 SRN 300-133 129 LSTM 50-129 123 LSTM 100-120 115 LSTM 300-123 119 SCRN 40 10 133 127 SCRN 90 10 124 119 SCRN 100 40 120 115 SCRN 300 40 120 115

Structurally Constrained RNNs Language Modeling on Text8 Corpus Model #hidden context = 0 context = 10 context = 20 context = 40 context = 80 SCRN 100 245 215 201 189 184 SCRN 300 202 182 172 165 164 SCRN 500 184 177 166 162 161 able 3: Structurally constrained recurrent nets: perplexity for various sizes of the contextual layer,

Long Short-Term Memory (LSTM) recently gained a lot of popularity have explicit memory cells to store short-term activations the presence of additional gates partly alleviates the vanishing gradient problem multi-layer versions shown to work quite well on tasks which have medium term dependencies Hochreiter et.al., 1997: Long Short-Term Memory

Long Short-Term Memory (LSTM) y t R h t U x t A Hochreiter et.al., 1997: Long Short-Term Memory

Long Short-Term Memory (LSTM) h t = c t o t o t x t R y t h t U 1.0 Cell c t = c t h t 1 1 + g t i t x t A i t x t g t h t 1 h t 1 Hochreiter et.al., 1997: Long Short-Term Memory x t

Long Short-Term Memory (LSTM) h t = c t o t o t x t h t 1 h t 1 f x t t Cell c t = f t c t 1 + g t i t i t x t g t h t 1 h t 1 Hochreiter et.al., 1997: Long Short-Term Memory x t

Long Short-Term Memory (LSTM) h t = c t o t o t x t h t 1 h t 1 x f t t Cell c t = f t c t 1 + g t i t i t x t Peep-Hole Connections g t h t 1 h t 1 x t Hochreiter et.al., 1997: Long Short-Term Memory

LSTM Training Backpropagation Through Time: BPTT

Deep LSTMs

Bi-Directional LSTMs

Applications of LSTMs

Automatic Speech Recognition Use bi-directional LSTMs to represent the audio sequence plug a classifier on top of the representation to directly predict phone classes Graves et. al., 2014: Speech Recognition with Deep Recurrent Neural Networks

Automatic Speech Recognition Table 1. TIMIT Phoneme Recognition Results. Epochs is the number of passes through the training set before convergence. PER is the phoneme error rate on the core test set. NETWORK WEIGHTS EPOCHS PER CTC-3L-500H-TANH 3.7M 107 37.6% CTC-1L-250H 0.8M 82 23.9% CTC-1L-622H 3.8M 87 23.0% CTC-2L-250H 2.3M 55 21.0% CTC-3L-421H-UNI 3.8M 115 19.6% CTC-3L-250H 3.8M 124 18.6% CTC-5L-250H 6.8M 150 18.4% TRANS-3L-250H 4.3M 112 18.3% PRETRANS-3L-250H 4.3M 144 17.7% Graves et. al., 2014: Speech Recognition with Deep Recurrent Neural Networks

Sequence to Sequence Learning A B C > W X Y Z Machine Translation Short Text Response Generation Sentence Summarization Sutskever at. al., 2014: Sequence to Sequence Learning with Neural Network

Sequence to Sequence Learning Method test BLEU score (ntst14) Bahdanau et al. [2] 28.45 Baseline System [29] 33.30 Single forward LSTM, beam size 12 26.17 Single reversed LSTM, beam size 12 30.59 Ensemble of 5 reversed LSTMs, beam size 1 33.00 Ensemble of 2 reversed LSTMs, beam size 12 33.27 Ensemble of 5 reversed LSTMs, beam size 2 34.50 Ensemble of 5 reversed LSTMs, beam size 12 34.81 State-of-the-art WMT 14 result: 37.0 Sutskever at. al., 2014: Sequence to Sequence Learning with Neural Network

Unsupervised Training on Video Auto-encoder Model Learned Representation ˆv 3 ˆv 2 ˆv 1 W 1 W 1 copy W 2 W 2 v 1 v 2 v 3 v 3 v 2 Srivastav et. al., 2014: Unsupervised Learning of Video Representation using LSTMs

Unsupervised Training on Video Future Frame Predictor Model Learned Representation ˆv 4 ˆv 5 ˆv 6 W 1 W 1 copy W 2 W 2 v 1 v 2 v 3 v 4 v 5 Srivastav et. al., 2014: Unsupervised Learning of Video Representation using LSTMs

Unsupervised Training on Video Composite Model Input Reconstruction ˆv 3 ˆv 2 ˆv 1 W 2 W 2 Learned Representation copy v 3 v 2 W 1 W 1 copy ˆv 4 ˆv 5 ˆv 6 v 1 v 2 v 3 W 3 W 3 Sequence of Input Frames Future Prediction v 4 v 5 Srivastav et. al., 2014: Unsupervised Learning of Video Representation using LSTMs

Gated Recurrent Units z h ~ r h IN OUT Update gate: z j t = (W z x t + U z h t 1 ) j. Reset gate: r j t = (W r x t + U r h t 1 ) j. ustration of the GRU. Candidate activation: hj t = tanh (W x t + U (r t h t 1 )) j, h j t =(1 z j t )h j t 1 + zj t h j t,

Implementation Torch code available (soon!) Standard RNN LSTMs SCRNN and other models.. GPU compatible

Open Problems Encoding long-term memory into RNNs Speed-up the RNN training Control problems Language understanding