Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Similar documents
RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

EE-559 Deep learning LSTM and GRU

Faster Training of Very Deep Networks Via p-norm Gates

Recurrent Neural Networks. Jian Tang

Memory-Augmented Attention Model for Scene Text Recognition

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

Compressing deep neural networks

Analysis of the Learning Process of a Recurrent Neural Network on the Last k-bit Parity Function

CSC321 Lecture 10 Training RNNs

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

RECURRENT NETWORKS I. Philipp Krähenbühl

Deep Gate Recurrent Neural Network

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Introduction to RNNs!

Learning Long-Term Dependencies with Gradient Descent is Difficult

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Stephen Scott.

EE-559 Deep learning Recurrent Neural Networks

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Natural Language Processing and Recurrent Neural Networks

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Short-term water demand forecast based on deep neural network ABSTRACT

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Sequence Modeling with Neural Networks

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Spatial Transformer. Ref: Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu, Spatial Transformer Networks, NIPS, 2015

CSC321 Lecture 15: Exploding and Vanishing Gradients

Slide credit from Hung-Yi Lee & Richard Socher

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

arxiv: v3 [cs.lg] 14 Jan 2018

Improved Learning through Augmenting the Loss

Lecture 5 Neural models for NLP

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

Generating Sequences with Recurrent Neural Networks

Lecture 15: Exploding and Vanishing Gradients

Lecture 17: Neural Networks and Deep Learning

Demystifying deep learning. Artificial Intelligence Group Department of Computer Science and Technology, University of Cambridge, UK

Recurrent Neural Networks (RNNs) Lecture 9 - Networks for Sequential Data RNNs & LSTMs. RNN with no outputs. RNN with no outputs

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Spatial Transformation

arxiv: v1 [cs.cl] 21 May 2017

Neural Networks Language Models

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Long-Short Term Memory and Other Gated RNNs

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

CSC321 Lecture 16: ResNets and Attention

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Introduction to Deep Neural Networks

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

ECE521 Lectures 9 Fully Connected Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Neural Networks. David Rosenberg. July 26, New York University. David Rosenberg (New York University) DS-GA 1003 July 26, / 35

Recurrent neural networks

Presented By: Omer Shmueli and Sivan Niv

Based on the original slides of Hung-yi Lee

Benchmarking Functional Link Expansions for Audio Classification Tasks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning

MULTIPLICATIVE LSTM FOR SEQUENCE MODELLING

Deep Learning Lab Course 2017 (Deep Learning Practical)

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Deep Learning for Computer Vision

Convolutional Neural Network Architecture

Long-Short Term Memory

Seq2Tree: A Tree-Structured Extension of LSTM Network

Classification with Perceptrons. Reading:

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Learning Deep Architectures for AI. Part II - Vijay Chakilam

A Tutorial On Backward Propagation Through Time (BPTT) In The Gated Recurrent Unit (GRU) RNN

Machine Learning for Computer Vision 8. Neural Networks and Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

Deep Feedforward Networks

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Convolutional Neural Networks

Recurrent Neural Network

Based on the original slides of Hung-yi Lee

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Comparison of Modern Stochastic Optimization Algorithms

Deep Learning & Artificial Intelligence WS 2018/2019

Random Coattention Forest for Question Answering

Solutions. Part I Logistic regression backpropagation with a single training example

Deep Learning For Mathematical Functions

Combining Static and Dynamic Information for Clinical Event Prediction

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Computational statistics

Lecture 35: Optimization and Neural Nets

arxiv: v1 [stat.ml] 18 Nov 2017

Recurrent Neural Networks

CSC321 Lecture 2: Linear Regression

Transcription:

2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh, D. Comminiello, S. Totaro and A. Uncini

Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

Content at a glance Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks,...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of additional adaptable parameters, to model a wider range of gating functions.

Content at a glance Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks,...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of additional adaptable parameters, to model a wider range of gating functions.

Content at a glance Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks,...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of additional adaptable parameters, to model a wider range of gating functions.

Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

Gated unit: basic model Definition: (vanilla) gated unit For a generic input x we have: g (x) = σ (Wx) f (x), (1) where σ( ) is the sigmoid function, is the element-wise multiplication, and f (x) a generic network component. Notable examples: LSTM networks (Hochreiter and Schmidhuber, 1997). Gated recurrent units (Cho et al., 2014). Highway networks (Srivastava et al., 2015). Neural arithmetic logic unit (Trask et al., 2018).

Gated recurrent unit (GRU) At each time step t we receive x t R d and update the internal state h t 1 as: u t = σ (W u x t + V u h t 1 + b u ), (2) r t = σ (W r x t + V r h t 1 + b r ), (3) h t = (1 u t ) h t 1 + ) u t tanh (W h x t + U t (r t h t 1 ) + b h, (4) where (2)-(3) are the update gate and reset gate. Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

Gated recurrent unit (GRU) At each time step t we receive x t R d and update the internal state h t 1 as: u t = σ (W u x t + V u h t 1 + b u ), (2) r t = σ (W r x t + V r h t 1 + b r ), (3) h t = (1 u t ) h t 1 + ) u t tanh (W h x t + U t (r t h t 1 ) + b h, (4) where (2)-(3) are the update gate and reset gate. Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

Gated recurrent unit (GRU) At each time step t we receive x t R d and update the internal state h t 1 as: u t = σ (W u x t + V u h t 1 + b u ), (2) r t = σ (W r x t + V r h t 1 + b r ), (3) h t = (1 u t ) h t 1 + ) u t tanh (W h x t + U t (r t h t 1 ) + b h, (4) where (2)-(3) are the update gate and reset gate. Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

Training the network (classification) N sequences { x i t} N i=1 with labels yi = 1,..., C. h i is the internal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification: ( ) ŷ i = softmax Ah i + b, (5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = 1 N N i=1 C c=1 [ y i = c ] ( ) log ŷ i j, (6)

Training the network (classification) N sequences { x i t} N i=1 with labels yi = 1,..., C. h i is the internal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification: ( ) ŷ i = softmax Ah i + b, (5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = 1 N N i=1 C c=1 [ y i = c ] ( ) log ŷ i j, (6)

Training the network (classification) N sequences { x i t} N i=1 with labels yi = 1,..., C. h i is the internal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification: ( ) ŷ i = softmax Ah i + b, (5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = 1 N N i=1 C c=1 [ y i = c ] ( ) log ŷ i j, (6)

Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

Summary of the proposal Key items of our proposal: 1. Maintain the linear component, but replace the sigmoid element-wise operation with a generalized sigmoid function. 2. We extend the kernel activation function (KAF), a recently proposed non-parametric activation function. 3. We modify the KAF to ensure that it behaves correctly as a gating function.

Summary of the proposal Key items of our proposal: 1. Maintain the linear component, but replace the sigmoid element-wise operation with a generalized sigmoid function. 2. We extend the kernel activation function (KAF), a recently proposed non-parametric activation function. 3. We modify the KAF to ensure that it behaves correctly as a gating function.

Summary of the proposal Key items of our proposal: 1. Maintain the linear component, but replace the sigmoid element-wise operation with a generalized sigmoid function. 2. We extend the kernel activation function (KAF), a recently proposed non-parametric activation function. 3. We modify the KAF to ensure that it behaves correctly as a gating function.

Basic structure of the KAF A KAF models each activation function in terms of a kernel expansion over D terms as: KAF(s) = D α i κ (s, d i ), (7) i=1 where: 1. {α i } D i=1 are the mixing coefficients; 2. {d i } D i=1 are the dictionary elements; 3. κ(, ) : R R R is a 1D kernel function. Scardapane, S., Van Vaerenbergh, S., Totaro, S. and Uncini, A., 2017. Kafnets: kernel-based non-parametric activation functions for neural networks. arxiv preprint arxiv:1707.04035.

Extending KAFs for gated units We cannot use a KAF straightforwardly because it is unbounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: ( 1 σ KAF (s) = σ 2 KAF(s) + 1 ) 2 s. (8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.

Extending KAFs for gated units We cannot use a KAF straightforwardly because it is unbounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: ( 1 σ KAF (s) = σ 2 KAF(s) + 1 ) 2 s. (8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.

Extending KAFs for gated units We cannot use a KAF straightforwardly because it is unbounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: ( 1 σ KAF (s) = σ 2 KAF(s) + 1 ) 2 s. (8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.

5 0 Activation (a) γ = 1.0 5 Value of the gate Value of the gate Value of the gate Visualizing the new gates 5 0 Activation (b) γ = 0.5 5 5 0 Activation 5 (c) γ = 0.1 Figure 1: Random samples of the proposed flexible gates with Gaussian kernel and different hyperparameters.

Initializing the mixing coefficients To simplify optimization we initialize the mixing coefficients to approximate the identity function: α = (K + εi) 1 d, (9) where ε > 0 is a small constant. We then use a different set of mixing coefficients for each forget gate and update gate. Gate output 1.0 0.5 0.0 0.5 1.0 2 0 2 Activation

Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

Sequential MNIST benchmark [Row-wise MNIST (R-MNIST)] Each image is processed sequentially, row-by-row, i.e., we have sequences of length 28, each represented by the value of 28 pixels. [Pixel-wise MNIST (P-MNIST)] Each image is represented as a sequence of 784 pixels, read from left to right and from top to bottom from the original image. [Permuted P-MNIST (PP-MNIST)] Similar to P-MNIST, but the order of the pixels is shuffled using a (fixed) permutation matrix.

Models and hyperparameters 1. We compare standard GRUs and GRUs with the proposed flexible gating function. 2. GRUs have 100 units and we include an additional batch normalization step to stabilize training. 3. We train with Adam on mini-batches of 32 elements, with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0. 4. For the proposed gate, we use the Gaussian kernel and initialize the dictionary from 10 elements equispaced in [ 4.0, 4.0]. 5. We compute the average accuracy of the model every 25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

Models and hyperparameters 1. We compare standard GRUs and GRUs with the proposed flexible gating function. 2. GRUs have 100 units and we include an additional batch normalization step to stabilize training. 3. We train with Adam on mini-batches of 32 elements, with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0. 4. For the proposed gate, we use the Gaussian kernel and initialize the dictionary from 10 elements equispaced in [ 4.0, 4.0]. 5. We compute the average accuracy of the model every 25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

Models and hyperparameters 1. We compare standard GRUs and GRUs with the proposed flexible gating function. 2. GRUs have 100 units and we include an additional batch normalization step to stabilize training. 3. We train with Adam on mini-batches of 32 elements, with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0. 4. For the proposed gate, we use the Gaussian kernel and initialize the dictionary from 10 elements equispaced in [ 4.0, 4.0]. 5. We compute the average accuracy of the model every 25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

Models and hyperparameters 1. We compare standard GRUs and GRUs with the proposed flexible gating function. 2. GRUs have 100 units and we include an additional batch normalization step to stabilize training. 3. We train with Adam on mini-batches of 32 elements, with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0. 4. For the proposed gate, we use the Gaussian kernel and initialize the dictionary from 10 elements equispaced in [ 4.0, 4.0]. 5. We compute the average accuracy of the model every 25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

Models and hyperparameters 1. We compare standard GRUs and GRUs with the proposed flexible gating function. 2. GRUs have 100 units and we include an additional batch normalization step to stabilize training. 3. We train with Adam on mini-batches of 32 elements, with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0. 4. For the proposed gate, we use the Gaussian kernel and initialize the dictionary from 10 elements equispaced in [ 4.0, 4.0]. 5. We compute the average accuracy of the model every 25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

Accuracy on the test set Dataset GRU (Standard) GRU (proposed) R-MNIST 98.29 ± 0.01 98.67 ± 0.02 P-MNIST 89.50 ± 5.64 97.34 ± 0.61 PP-MNIST 86.41 ± 6.71 96.10 ± 0.93 Table 1: Average test accuracy obtained by a standard GRU compared with a GRU endowed with the proposed flexible gates (with standard deviation).

Evolution of the loss and validation accuracy Standard GRU Proposed GRU 2.0 Accuracy Loss 1.5 0.8 1.0 0.5 0.6 0.4 Standard GRU Proposed GRU 0.2 0 500 1000 1500 Epoch 2000 (a) Training loss 2500 3000 0 50 100 150 Epoch 200 250 300 (b) Validation accuracy Figure 2: Convergence results on the P-MNIST dataset for a standard GRU and the proposed GRU.

Distribution of the kernel s bandwidths 20 Number of cells 15 10 5 0 0.05 0.10 0.15 0.20 0.25 0.30 Value for gamma Figure 3: Sample histogram of the values for the kernel s hyperparameters, after training, for the reset gate of the GRU.

Ablation study Rand+No-Residual No-Residual Rand Normal 0.96 0.97 0.98 0.99 Test accuracy Figure 4: Average results of an ablation study on the R-MNIST dataset. Rand: we initialize the mixing coefficients randomly. No-Residual: we remove the residual connection in (8). With a dashed red line we show the performance of a standard GRU.

Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

Summary We proposed an extension of the standard gating component used in most gated RNNs. To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). Need more experiments with other gated RNNs, applications, and interpretability of the resulting functions with respect to the task at hand.

Summary We proposed an extension of the standard gating component used in most gated RNNs. To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). Need more experiments with other gated RNNs, applications, and interpretability of the resulting functions with respect to the task at hand.

Summary We proposed an extension of the standard gating component used in most gated RNNs. To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). Need more experiments with other gated RNNs, applications, and interpretability of the resulting functions with respect to the task at hand.

Summary We proposed an extension of the standard gating component used in most gated RNNs. To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). Need more experiments with other gated RNNs, applications, and interpretability of the resulting functions with respect to the task at hand.

Questions?