Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh, D. Comminiello, S. Totaro and A. Uncini

Contents Introduction Overview Gated recurrent networks Formulation Proposed gate with flexible sigmoid Kernel activation function KAF generalization for gates Experimental validation Experimental setup Results Conclusion and future works Summary and future outline

Content at a glance Setting: Gated units have become an integral part of deep learning (e.g., LSTMs, highway networks,...). State-of-the-art: Small number of studies on how to design more flexible gate architectures (e.g., Gao and Glowacka, ACML 2016). Objective: Design an enhanced gate, with a small number of additional adaptable parameters, to model a wider range of gating functions.

Gated unit: basic model Definition: (vanilla) gated unit For a generic input x we have: g (x) = σ (Wx) f (x), (1) where σ( ) is the sigmoid function, is the element-wise multiplication, and f (x) a generic network component. Notable examples: LSTM networks (Hochreiter and Schmidhuber, 1997). Gated recurrent units (Cho et al., 2014). Highway networks (Srivastava et al., 2015). Neural arithmetic logic unit (Trask et al., 2018).

Gated recurrent unit (GRU) At each time step t we receive x t R d and update the internal state h t 1 as: u t = σ (W u x t + V u h t 1 + b u ), (2) r t = σ (W r x t + V r h t 1 + b r ), (3) h t = (1 u t ) h t 1 + ) u t tanh (W h x t + U t (r t h t 1 ) + b h, (4) where (2)-(3) are the update gate and reset gate. Cho, K. et al., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. EMNLP 2014.

Training the network (classification) N sequences { x i t} N i=1 with labels yi = 1,..., C. h i is the internal state of the GRU after processing the i-th sequence. This is fed through another layer with a softmax activation function for classification: ( ) ŷ i = softmax Ah i + b, (5) We then minimize the average cross-entropy between the real classes and the predicted classes: J(θ) = 1 N N i=1 C c=1 [ y i = c ] ( ) log ŷ i j, (6)

Summary of the proposal Key items of our proposal: 1. Maintain the linear component, but replace the sigmoid element-wise operation with a generalized sigmoid function. 2. We extend the kernel activation function (KAF), a recently proposed non-parametric activation function. 3. We modify the KAF to ensure that it behaves correctly as a gating function.

Basic structure of the KAF A KAF models each activation function in terms of a kernel expansion over D terms as: KAF(s) = D α i κ (s, d i ), (7) i=1 where: 1. {α i } D i=1 are the mixing coefficients; 2. {d i } D i=1 are the dictionary elements; 3. κ(, ) : R R R is a 1D kernel function. Scardapane, S., Van Vaerenbergh, S., Totaro, S. and Uncini, A., 2017. Kafnets: kernel-based non-parametric activation functions for neural networks. arxiv preprint arxiv:1707.04035.

Extending KAFs for gated units We cannot use a KAF straightforwardly because it is unbounded and potentially vanishing to zero (e.g. with the Gaussian kernel). We use the following modified formulation for the flexible gate: ( 1 σ KAF (s) = σ 2 KAF(s) + 1 ) 2 s. (8) As in the original KAF, dictionary elements are fixed (by uniform sampling around 0), while we adapt everything else.

5 0 Activation (a) γ = 1.0 5 Value of the gate Value of the gate Value of the gate Visualizing the new gates 5 0 Activation (b) γ = 0.5 5 5 0 Activation 5 (c) γ = 0.1 Figure 1: Random samples of the proposed flexible gates with Gaussian kernel and different hyperparameters.

Initializing the mixing coefficients To simplify optimization we initialize the mixing coefficients to approximate the identity function: α = (K + εi) 1 d, (9) where ε > 0 is a small constant. We then use a different set of mixing coefficients for each forget gate and update gate. Gate output 1.0 0.5 0.0 0.5 1.0 2 0 2 Activation

Sequential MNIST benchmark [Row-wise MNIST (R-MNIST)] Each image is processed sequentially, row-by-row, i.e., we have sequences of length 28, each represented by the value of 28 pixels. [Pixel-wise MNIST (P-MNIST)] Each image is represented as a sequence of 784 pixels, read from left to right and from top to bottom from the original image. [Permuted P-MNIST (PP-MNIST)] Similar to P-MNIST, but the order of the pixels is shuffled using a (fixed) permutation matrix.

Models and hyperparameters 1. We compare standard GRUs and GRUs with the proposed flexible gating function. 2. GRUs have 100 units and we include an additional batch normalization step to stabilize training. 3. We train with Adam on mini-batches of 32 elements, with an initial learning rate of 0.001, and we clip all gradients updates (in norm) to 1.0. 4. For the proposed gate, we use the Gaussian kernel and initialize the dictionary from 10 elements equispaced in [ 4.0, 4.0]. 5. We compute the average accuracy of the model every 25 iterations on the validation set, stopping whenever accuracy is not improving for at least 500 iterations.

Accuracy on the test set Dataset GRU (Standard) GRU (proposed) R-MNIST 98.29 ± 0.01 98.67 ± 0.02 P-MNIST 89.50 ± 5.64 97.34 ± 0.61 PP-MNIST 86.41 ± 6.71 96.10 ± 0.93 Table 1: Average test accuracy obtained by a standard GRU compared with a GRU endowed with the proposed flexible gates (with standard deviation).

Evolution of the loss and validation accuracy Standard GRU Proposed GRU 2.0 Accuracy Loss 1.5 0.8 1.0 0.5 0.6 0.4 Standard GRU Proposed GRU 0.2 0 500 1000 1500 Epoch 2000 (a) Training loss 2500 3000 0 50 100 150 Epoch 200 250 300 (b) Validation accuracy Figure 2: Convergence results on the P-MNIST dataset for a standard GRU and the proposed GRU.

Distribution of the kernel s bandwidths 20 Number of cells 15 10 5 0 0.05 0.10 0.15 0.20 0.25 0.30 Value for gamma Figure 3: Sample histogram of the values for the kernel s hyperparameters, after training, for the reset gate of the GRU.

Ablation study Rand+No-Residual No-Residual Rand Normal 0.96 0.97 0.98 0.99 Test accuracy Figure 4: Average results of an ablation study on the R-MNIST dataset. Rand: we initialize the mixing coefficients randomly. No-Residual: we remove the residual connection in (8). With a dashed red line we show the performance of a standard GRU.

Summary We proposed an extension of the standard gating component used in most gated RNNs. To this end, we extend the kernel activation function in order to make its shape always consistent with a sigmoid-like behavior. Experiments show that the proposed architecture achieves superior results (in terms of test accuracy), while at the same time converging faster (and more reliably). Need more experiments with other gated RNNs, applications, and interpretability of the resulting functions with respect to the task at hand.

Questions?