High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Similar documents
Introduction to RNNs!

EE-559 Deep learning LSTM and GRU

Improved Learning through Augmenting the Loss

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Combining Static and Dynamic Information for Clinical Event Prediction

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

CSC321 Lecture 10 Training RNNs

Recurrent Residual Learning for Sequence Classification

Minimal Gated Unit for Recurrent Neural Networks

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v1 [stat.ml] 18 Nov 2017

Based on the original slides of Hung-yi Lee

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

arxiv: v1 [cs.ne] 19 Dec 2016

Recurrent Neural Network

Deep learning for Natural Language Processing and Machine Translation

Deep Gate Recurrent Neural Network

Gated Feedback Recurrent Neural Networks

arxiv: v2 [cs.lg] 29 Feb 2016

arxiv: v1 [stat.ml] 18 Nov 2016

CKY-based Convolutional Attention for Neural Machine Translation

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

Character-Level Language Modeling with Recurrent Highway Hypernetworks

Edinburgh Research Explorer

arxiv: v3 [cs.lg] 12 Nov 2016

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

CSC321 Lecture 15: Recurrent Neural Networks

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Neural Machine Translation with Linear Associative Unit

Faster Training of Very Deep Networks Via p-norm Gates

IMPROVING STOCHASTIC GRADIENT DESCENT

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Long-Short Term Memory and Other Gated RNNs

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

YNU-HPCC at IJCNLP-2017 Task 1: Chinese Grammatical Error Diagnosis Using a Bi-directional LSTM-CRF Model

Self-Attention with Relative Position Representations

Review Networks for Caption Generation

Seq2Tree: A Tree-Structured Extension of LSTM Network

Implicitly-Defined Neural Networks for Sequence Labeling

arxiv: v1 [cs.ne] 3 Jun 2016

Recurrent Neural Networks. Jian Tang

arxiv: v5 [cs.lg] 4 Jul 2017

arxiv: v4 [cs.ne] 22 Sep 2017 ABSTRACT

Stochastic Dropout: Activation-level Dropout to Learn Better Neural Language Models

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Lecture 17: Neural Networks and Deep Learning

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

RECURRENT NETWORKS I. Philipp Krähenbühl

Gated Recurrent Neural Tensor Network

CAN RECURRENT NEURAL NETWORKS WARP TIME?

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

Lecture 11 Recurrent Neural Networks I

arxiv: v1 [cs.lg] 9 Nov 2015

Key-value Attention Mechanism for Neural Machine Translation

Deep Neural Solver for Math Word Problems

Slide credit from Hung-Yi Lee & Richard Socher

Character-level Language Modeling with Gated Hierarchical Recurrent Neural Networks

arxiv: v1 [cs.lg] 1 Mar 2017

Short-term water demand forecast based on deep neural network ABSTRACT

Long-Short Term Memory

KNOWLEDGE-GUIDED RECURRENT NEURAL NETWORK LEARNING FOR TASK-ORIENTED ACTION PREDICTION

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

arxiv: v1 [cs.lg] 7 Apr 2017

CSC321 Lecture 16: ResNets and Attention

HOW FAR CAN WE GO WITHOUT CONVOLUTION: IM-

Preventing Gradient Explosions in Gated Recurrent Units

arxiv: v2 [cs.ne] 7 Apr 2015

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Stephen Scott.

arxiv: v3 [cs.cl] 24 Feb 2018

PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs

IMPROVING PERFORMANCE OF RECURRENT NEURAL

CS224N Final Project: Exploiting the Redundancy in Neural Machine Translation

Recurrent Neural Networks

Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

arxiv: v1 [cs.lg] 28 Dec 2017

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Compressing Recurrent Neural Network with Tensor Train

Random Coattention Forest for Question Answering

arxiv: v4 [cs.ne] 20 Apr 2018

Implicitly-Defined Neural Networks for Sequence Labeling

Exploiting Contextual Information via Dynamic Memory Network for Event Detection

ROTATIONAL UNIT OF MEMORY

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Sequence Modeling with Neural Networks

Recurrent neural networks

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

EE-559 Deep learning Recurrent Neural Networks

Deep Q-Learning with Recurrent Neural Networks

Multiple-Weight Recurrent Neural Networks

Maxout Networks. Hien Quoc Dang

Improving Attention Modeling with Implicit Distortion and Fertility for Machine Translation

arxiv: v3 [stat.ml] 3 Mar 2017

Conditional Language modeling with attention

Tensor Decomposition for Compressing Recurrent Neural Network

Transcription:

High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long range dependences. But people have proposed LSTM and GRU, which try to model the differences between adjacent data frame rather than the data frame itself. By doing so, it allows the error to back propagate throw longer time without vanishing. Also instead of simply accumulating information, these models introduce multiple gating functions which depend on current input and hidden states to control whether information should be taken-in, passed though or discarded. Currently all the gating functions are sigmoid/tanh function over a linear combination of input x and current hidden states h. In this work, we are trying to introduce a high order term for RNN as a component of gating function. By doing so, we argue that it can better model the relation between current input and hidden states, thus better control the flow of information and learn better representation accordingly. 1

z h r ~ h IN OUT (a) LSTM[3] (b) GRU[2] 2 Related Work 3 Model 3.1 LSTM The basic building block of LSTM[3] is shown in Fig. 1a. It has three gates as shown below: i t = σ(w xi x t + w hi h t 1 + b i ) f t = σ(w xf x t + w hf h t 1 + b f ) o t = σ(w xo x t + w ho h t 1 + b o ) Correspondingly, the update equations are: c t = f t c t 1 + i t tanh(w xc x t + w hc h t 1 + b c ) h t = o t tanh(c t ) where is element-wise multiplication. 3.2 GRU Alternatively, [1] developed another variant of RNN cell called gated recurrent unit. [2], [5] have shown this type of cell performs comparable/better than LSTM. The basic building block of GRU is shown in Fig. 1b. Alternatively, the 2

update equations are as follows: 3.3 High order variant z t = σ(w xz x t + w hz h t 1 + b z ) r t = σ(w xr x t + w hr h t 1 + b r ) ht = tanh(w x t + w hh (r t h t 1 ) + b h ) h t = z t h t 1 + (1 z t ) h t In LSTM and GRU, the gate functions at time t are all sigmoid function over a linear combination of current input x t and the memory represented via h t 1. While gating functions are crucial for the network s performance as shown in [4], we further introduce a high order gating function: g t = σ(w x x t + w h h t 1 + b g + f(x t, h t 1 )) where all vectors have dimension n. Here we only consider second order information. Assuming we are using m high order kernels, then x t w (1) h t 1 x t w (2) f(x t, h t 1 ) = W h t 1., W Rn m (1) x t w (m) h t 1 where W is a mapping from m kernel output to a vector of dimension n as required. If we use low rank approximation, i.e. w (i) = r j=1 (v(i) j )(u (i) j ), we can rewrite each element in the high order term to be: r x t w (i) h t 1 = (v (i) j ) x t (u (i) j ) h t 1 j=1 As we are learning distributed feature representation, it s reasonable to use v (i) j same as u (i) j in order to reduce the number of parameters, i.e. high order kernel weight matrices w (i) are all systematic. Thus we have: x t w (i) h t 1 = V x t, V h t 1, V R r n (2) For each gating function, the number of parameters we introduced is n m + r n m, in addition to linear part 2 n n + n. 3

3.4 Alternative Alternatively, as developed in [6], the high order term can be: f(x t, h t 1 ) = W (Ux t V h t 1 ) (3) where represents for element-wise multiplication, and U, V R m n, W R n m. Correspondingly total number of parameters for each gating function is n m + 2 n m in addition to linear part 2 n n + n. The difference between Eq.3 and Eq.1, besides using different U and V, is that Eq.3 only uses one high kernel term whereas Eq.1 uses m high oder terms. But Eq.1 is NOT a general case for Eq.3. Current experiment doesn t show much difference. 3.5 MLP version Also, we can have a multiple layer perceptron for modeling the transition between hidden states. Similar architectures have been developed in [7] 4 Learning As shown in Eq.2, the high order term can be represented as a concatenation of a fully connected layer and a dot-product layer. Thus learning could also be done via standard back-propagation. 5 Experiments RNN, with either LSTM/GRU function, has been apply to multiple sequence analysis tasks, for example machine translation[9], caption generation[10], video analysis[8]. 5.1 Video Analysis Our first experiment is on synthetic data, i.e. bouncing digits, as using in [8]. Each video sequence contains a number of frames showing the one or multiple digits bouncing inside a fixed window. We experiment on three different tasks. Given a few input video sequence, we are trying to construct the input sequences themselves, and also predicting some future sequences. 4

References [1] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arxiv preprint arxiv:1409.1259, 2014. [2] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arxiv preprint arxiv:1412.3555, 2014. [3] Alex Graves. Generating sequences with recurrent neural networks. arxiv preprint arxiv:1308.0850, 2013. [4] Klaus Greff, Rupesh Kumar Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. Lstm: A search space odyssey. arxiv preprint arxiv:1503.04069, 2015. [5] Rafal Jozefowicz, Wojciech Zaremba, and Ilya Sutskever. An empirical exploration of recurrent network architectures. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 2342 2350, 2015. [6] Vincent Michalski, Roland Memisevic, and Kishore Konda. Modeling deep temporal dependencies with recurrent grammar cells. In Advances in neural information processing systems, pages 1925 1933, 2014. [7] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. arxiv preprint arxiv:1312.6026, 2013. [8] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms. arxiv preprint arxiv:1502.04681, 2015. [9] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104 3112, 2014. [10] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. arxiv preprint arxiv:1411.4555, 2014. 5