Recurrent neural networks with trainable amplitude of activation functions

Similar documents
Complex Valued Nonlinear Adaptive Filters

On the steady-state mean squared error of the fixed-point LMS algorithm

Adaptive Inverse Control

inear Adaptive Inverse Control

ADAPTIVE INVERSE CONTROL BASED ON NONLINEAR ADAPTIVE FILTERING. Information Systems Lab., EE Dep., Stanford University

New design of estimators using covariance information with uncertain observations in linear discrete-time systems

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Temporal Backpropagation for FIR Neural Networks

T Machine Learning and Neural Networks

4. Multilayer Perceptrons

New Recursive-Least-Squares Algorithms for Nonlinear Active Control of Sound and Vibration Using Neural Networks

Identification of nonlinear discrete-time systems using raised-cosine radial basis function networks

Adaptive Inverse Control based on Linear and Nonlinear Adaptive Filtering

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS

A normalized gradient algorithm for an adaptive recurrent perceptron

NONLINEAR PLANT IDENTIFICATION BY WAVELETS

An introduction to the use of neural networks in control systems

Neural Network Identification of Non Linear Systems Using State Space Techniques.

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Relating Real-Time Backpropagation and. Backpropagation-Through-Time: An Application of Flow Graph. Interreciprocity.

Adaptive MMSE Equalizer with Optimum Tap-length and Decision Delay

Automatic Structure and Parameter Training Methods for Modeling of Mechanical System by Recurrent Neural Networks

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

A METHOD OF ADAPTATION BETWEEN STEEPEST- DESCENT AND NEWTON S ALGORITHM FOR MULTI- CHANNEL ACTIVE CONTROL OF TONAL NOISE AND VIBRATION

Chapter 4 Neural Networks in System Identification

Artificial Neural Network

( t) Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks

THE concept of active sound control has been known

Lecture 5: Recurrent Neural Networks

ADAPTIVE FILTER THEORY

Approximate optimal control for a class of nonlinear discrete-time systems with saturating actuators

Designing Dynamic Neural Network for Non-Linear System Identification

Comparative Performance Analysis of Three Algorithms for Principal Component Analysis

Harnessing Nonlinearity: Predicting Chaotic Systems and Saving

Stochastic error whitening algorithm for linear filter estimation with noisy data

Introduction to Neural Networks

Neural Networks and the Back-propagation Algorithm

Using Neural Networks for Identification and Control of Systems

State-dependent parameter modelling and identification of stochastic non-linear sampled-data systems

EE482: Digital Signal Processing Applications

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

Introduction to Machine Learning Spring 2018 Note Neural Networks

Output regulation of general discrete-time linear systems with saturation nonlinearities

ONLINE TRACKING OF THE DEGREE OF NONLINEARITY WITHIN COMPLEX SIGNALS

IN neural-network training, the most well-known online

Chapter 15. Dynamically Driven Recurrent Networks

STOCHASTIC INFORMATION GRADIENT ALGORITHM WITH GENERALIZED GAUSSIAN DISTRIBUTION MODEL

Autonomous learning algorithm for fully connected recurrent networks

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

IS NEGATIVE STEP SIZE LMS ALGORITHM STABLE OPERATION POSSIBLE?

A Novel 2-D Model Approach for the Prediction of Hourly Solar Radiation

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

On the Use of A Priori Knowledge in Adaptive Inverse Control

Neural networks. Chapter 20. Chapter 20 1

Artificial Intelligence

Variable Learning Rate LMS Based Linear Adaptive Inverse Control *

An efficient conjugate gradient method in the frequency domain: Application to channel equalization

2262 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 8, AUGUST A General Class of Nonlinear Normalized Adaptive Filtering Algorithms

ECE521 Lectures 9 Fully Connected Neural Networks

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

EE482: Digital Signal Processing Applications

A new modified Halley method without second derivatives for nonlinear equation

THE Back-Prop (Backpropagation) algorithm of Paul

y k = (a)synaptic f(x j ) link linear i/p o/p relation (b) Activation link linear i/p o/p relation

A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Efficient Use Of Sparse Adaptive Filters

Performance Comparison of Two Implementations of the Leaky. LMS Adaptive Filter. Scott C. Douglas. University of Utah. Salt Lake City, Utah 84112

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Artificial Neural Networks

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

STOCHASTIC INFORMATION GRADIENT ALGORITHM BASED ON MAXIMUM ENTROPY DENSITY ESTIMATION. Badong Chen, Yu Zhu, Jinchun Hu and Ming Zhang

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Introduction to DSP Time Domain Representation of Signals and Systems

Unit 8: Introduction to neural networks. Perceptrons

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Hopfield Neural Network

Recurrent Neural Networks

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption

Artificial Neural Network : Training

Multilayer Perceptrons (MLPs)

ADAPTIVE FILTER ALGORITHMS. Prepared by Deepa.T, Asst.Prof. /TCE

Neural networks. Chapter 19, Sections 1 5 1

Variable Selection in Regression using Multilayer Feedforward Network

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network

Negatively Correlated Echo State Networks

Slide credit from Hung-Yi Lee & Richard Socher

Bidirectional Representation and Backpropagation Learning

Title without the persistently exciting c. works must be obtained from the IEE

Echo State Networks with Filter Neurons and a Delay&Sum Readout

Recursive Generalized Eigendecomposition for Independent Component Analysis

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

CSC242: Intro to AI. Lecture 21

Multilayer Perceptrons and Backpropagation

Confidence Estimation Methods for Neural Networks: A Practical Comparison

PC-based pseudo-model following discrete integral variable structure control of positions in slider-crank mechanisms

Transcription:

Neural Networks 16 (2003) 1095 1100 www.elsevier.com/locate/neunet Neural Networks letter Recurrent neural networks with trainable amplitude of activation functions Su Lee Goh*, Danilo P. Mandic Imperial College of Science, Technology and Medicine, London SW7 2AZ, UK Accepted 9 May 2003 Abstract An adaptive amplitude real time recurrent learning (AARTRL) algorithm for fully connected recurrent neural networks (RNNs) employed as nonlinear adaptive filters is proposed. Such an algorithm is beneficial when dealing with signals that have rich and unknown dynamical characteristics. Following the approach from [Trentin, E. Network with trainable amplitude of activation functions, Neural Networks 14 (2001) 471], three different cases for the algorithm are considered; a common adaptive amplitude shared among all the neurons; each layer has its own adaptive amplitude; different adaptive amplitude for each neuron. Experimental results show the AARTRL outperforms the standard RTRL algorithm. q 2003 Elsevier Ltd. All rights reserved. Keywords: Nonlinear adaptive filters; Recurrent neural networks; Adaptive amplitude 1. Introduction The most frequently applied neural network architectures are the feedforward networks and recurrent networks. Feedforward (FF) networks represent static nonlinear models and can have multilayer architecture. Recurrent networks are dynamic networks and their structures are fundamentally different from the static ones, because they contain feedbacks (Mandic & Chambers, 2001). Recurrent neural networks (RNNs) can yield smaller structures than nonrecursive feedforward neural networks in the same way that infinite impulse response (IIR) filters can replace longer finite impulse response (FIR) filters. The feedback characteristics of RNNs enable them to acquire state representation, which make them suitable for applications such as nonlinear prediction and modelling, adaptive equalization of communication channels, speech processing, plant control and automobile engine diagnostics (Haykin, 1999). Thus, RNNs offer an alternative to the dynamically driven feedforward networks (Haykin, 1999). RNNs have traditionally been trained by the Real Time Recurrent Learning (RTRL) algorithm (Williams & Zipser, 1989) which provides the training process of the RNN. Another algorithm referred to as Back-propagation-through-time (BPTT) can also be used for training the RNN. However, * Corresponding author. E-mail addresses: su.goh@imperial.ac.uk (S.L. Goh), d.mandic@ imperial.ac.uk (D.P. Mandic). due to the computational complexity of the BPTT algorithm compared to the more simpler and efficient RTRL algorithm, we will be using the RTRL algorithm to train the RNN. In the paper by Trentin (2001), he has showed the importance of having adaptive amplitude of activation functions for the case of FF neural networks. Here, we embark upon the concept by Trentin and extend the derivation of the RTRL algorithm and introduce trainable amplitude of the activation function. This is important for nonlinear activation function since this way there is no need to standardised and rescale input signals to match the dynamical range of the activation function. The RTRL equipped with the adaptive amplitude can be employed where the dynamical range of the input signal is not known a priori. This way, the Adaptive Amplitude Real Time Recurrent Learning (AARTRL) algorithm is derived. The analysis is supported by examples of nonlinear prediction and Monte Carlo Analysis in which the AARTRL outperforms the RTRL. 2. Derivation of the fully connected recurrent neural network for the formulation of the real time learning algorithm The set of equations which describe the RNN given in Fig. 1 is y i ðkþ ¼Fðv i ðkþþ; i ¼ 1; ; N ð1þ 0893-6080/03/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/s0893-6080(03)00139-4

1096 S.L. Goh, D.P. Mandic / Neural Networks 16 (2003) 1095 1100 Fig. 1. Full recurrent neural network for prediction. v i ðkþ ¼ pþqþn X n¼1 w i;n ðkþu n ðkþ u T i ðkþ ¼½sðk 2 1Þ; ; sðk 2 pþ; 1; y 1 ðk 2 1Þ; ; y 1 ðk 2 qþ; y 2 ðk 2 1Þ; ; y N ðk 2 1ÞŠ where the unity element in Eq. (3) corresponds to the bias input to the neurons. The output values of the neurons are denoted by y 1 ; ; y N and the input samples are given by s: For the nth neuron, its weights form a ðp þ q þ NÞ 1 dimentional weight vector w T i ¼½w i;1 ; ; w i;pþqþn Š; where p is the number of external inputs, q the feedback connection of the first neuron and N is the number of neurons in the RNN. RTRL based training of the RNN for nonlinear adaptive filtering is based upon minimising the instantaneous squared error at the output of the first neuron of the RNN (Williams & Zipser, 1989), which can be expressed as minðe 2 ðkþþ ¼ minð½sðkþ 2 y 1 ðkþš 2 Þ; where eðkþ denotes the error at the output of the RNN and sðkþ is the desired signal. This way, for a cost function EðkÞ ¼ 1 2 e2 ðkþ; ð2þ ð3þ the weight matrix update becomes Dw i;n ðkþ ¼2h EðkÞ ¼ 2heðkÞ eðkþ ð4þ Since the external signal vector s does not depend on the elements of w; the error gradient becomes eðkþ ¼ 2 y 1ðkÞ This can be rewritten as y 1 ðkþ ¼F0 ðv 1 ðkþþ v 1ðkÞ ¼F 0 ðv 1 ðkþþ XN i¼1 y i ðk21þ w 1;iþpþqðkÞþd in u n ðkþ where ( d in ¼ 1 i¼n ð7þ 0 i n Under the assumption, also used in the RTRL algorithm (Williams & Zipser, 1989), that when the learning rate h is! ð5þ ð6þ

S.L. Goh, D.P. Mandic / Neural Networks 16 (2003) 1095 1100 1097 sufficiently small, we have y i ðk21þ < y iðk21þ ; i¼1; ;N ð8þ w i;n ðk21þ y 1 ðk2iþ < y 1ðk2iÞ ; i¼1;2; ;q ð9þ w i;n ðk2iþ Introducing a triply indexed set of variables p j i;n to characterize the RTRL algorithm for the RNN, the sensitivities y j ðkþ w i;l ðkþ (Haykin, 1999; Williams & Zipser, 1989) become p j i;n ðkþ¼ y jðkþ 1#j;i#N;1#n#pþqþN ð10þ " # X p j N i;n ðkþ1þ¼f0 ðv j ðkþþ w j;m ðkþp m i;nðkþþd ij u n ðkþ ð11þ with initial conditions p j i;n ð0þ¼0 m¼1 ð12þ To simplify the presentation, we introduce three new matrices, the N ðnþpþqþ matrix P i ðkþ; the N ðnþpþqþ matrix U j ðkþ; and the N N diagonal matrix FðkÞ as (Haykin, 1999) P j ðkþ¼ yðkþ w j ðkþ ; y¼½y 1ðkÞ; ;y N ðkþš;j¼1;2; ;N ð13þ FðkÞ¼diagðF 0 ðu T ðkþw 1 ðkþþ; ;F 0 ðu T ðkþw N ðkþþþ P j ðkþ1þ¼fðkþ½u j ðkþþw a ðkþp j ðkþš ð14þ ð15þ ð16þ where W a denotes the set of those enteries in W which correspond to the feedback connections. 3. Derivation of gradient-based algorithms to learn amplitudes The AARTRL algorithm relies on the amplitude of the nonlinear activation function to be adaptive according to the change in the dynamics of the input signals. Extending the approach from Trentin (2001), we can express the activation function as FðkÞ ¼l i ðkþ FðkÞ ð17þ where l i denotes the amplitude of the nonlinearity, FðkÞ; whereas FðkÞ denotes the activation function with a unit amplitude. Thus if l i ¼ 1 it follows that FðkÞ ¼FðkÞ: The update for the gradient adaptive amplitude is given by (Trentin, 2001) l i ðk þ 1Þ ¼l i ðkþ 2 r7 li ðkþe i ðkþ E i ðkþ ¼ 1 2 e2 i ðkþ ¼ 1 2 ½sðkÞ 2 y iðkþš 2 ð18þ ð19þ where 7 li ðkþe i ðkþ denotes the gradient of the cost function with respect to the amplitude of the activation function l and r denotes the step size of the algorithm and is a small constant. From Eq. (19), the gradient can be obtained as 7 li ðkþe i ðkþ ¼ E iðkþ l i ðkþ ¼ e iðkþ e iðkþ l i ðkþ ¼ 2e iðkþ y iðkþ ð20þ l i ðkþ where y i ðkþ ¼l i ðkþ Fðv i ðkþþ ¼ l i ðkþ Fðu T i ðkþ w i ðkþþ ð21þ From Eqs. (20) and (21), we compute the partial derivative as yi ðkþ l i ðkþ ¼ Fðu T i ðkþ w i ðkþþ þ l i ðkþf 0 ðu T i ðkþ w i ðkþþ ½uT i ðkþ w i ðkþš l i ðkþ ¼ Fðu T i ðkþ w i ðkþþ þ l i ðkþf 0 ðu T i ðkþ w i ðkþþ X p w l i ðkþ i;n sðk 2 nþþw i;pþ1 ðkþ þ Xq m¼1 þ XN l¼1 n¼1 w i;pþmþ1 ðkþy 1 ðk 2 mþ w 1;pþqþl ðkþy l ðk 2 lþ! ð22þ Since l iðk 2 1Þ ¼ 0 l i ðkþ the second term of Eq. (22) vanishes. This way we have 7 li ðkþe i ðkþ ¼ E iðkþ l i ðkþ ¼ 2e iðkþfðv i ðkþþ ð23þ We next consider three cases, that is when all the neurons in the RNN share a common l; when l is kept common within a layer and when every neuron has a different l: 3.1. Case 1: single l over the whole network In this case l i ðkþ ¼lðkÞ for all units for which Eq. (18) holds. Thus we could write y i ðkþ ¼Fðv i ðkþþ ¼ lðkþ Fðv i ðkþþ ð24þ

1098 S.L. Goh, D.P. Mandic / Neural Networks 16 (2003) 1095 1100 lðk þ 1Þ ¼lðkÞ 2 r7 lðkþ E 1 ðkþ l i ðk þ 1Þ ¼l i ðkþ 2 r7 li ðkþe i ðkþ ¼ lðkþþre 1 ðkþ Fðv 1 ðkþþ ð25þ ¼ l i ðkþþre i ðkþ Fðv i ðkþþ ð29þ 3.2. Case 2: different l for each layer In this case, the fully connected recurrent network can be considered to have two layers. The first layer consists of the output neuron and the second layer contains the rest of the neurons in the network. This way ( y i ðkþ ¼Fðv i ðkþþ ¼ l 1ðkÞFðv i ðkþþ; i ¼ 1 l 2 ðkþfðv ð26þ i ðkþþ; i ¼ 2; ; N l 1 ðk þ 1Þ ¼l 1 ðkþþre i ðkþ Fðv i ðkþþ l 2 ðk þ 1Þ ¼l 2 ðkþþre i ðkþ Fðv i ðkþþ 3.3. Case 3: different l i for each neuron of the network ð27þ This is the most general case where each neuron has it own amplitude adaptation l: In this case the equation becomes y i ðkþ ¼Fðv i ðkþþ ¼ l i ðkþ Fðv i ðkþþ ð28þ 4. Simulations For the experiments, the amplitude of the input signals were scaled to be within the range [0,0.1] and the nonlinearity at the neuron was chosen to be the logistic sigmoid function, FðxÞ ¼ lðkþ 1 þ e 2bx ð30þ with a slope b ¼ 1; learning rate h ¼ 0:3 and an initial amplitude lð0þ ¼1: The step size of the adaptive amplitude was chosen to be r ¼ 0:15: The architecture of the network for this experiment consists of five neurons with tap input length of 7 and feedback order of 7. The input signal was a AR(4) process given by rðkþ ¼ 1:79rðk 2 1Þ 2 1:85rðk 2 2Þþ 1:27rðk 2 3Þ 2 0:41rðk 2 4ÞþnðkÞ ð31þ with nðkþ, N(0,1) as the driving input. The noise was normally distributed Nð0; 1Þ: The nonlinear input signal Fig. 2. Performance curve for recurrent neural networks for prediction of coloured input.

S.L. Goh, D.P. Mandic / Neural Networks 16 (2003) 1095 1100 1099 Fig. 3. Performance curve for recurrent neural networks for prediction of nonlinear input. was calculated as (Narendra & Parthasarathy, 1990) zðk 2 1Þ zðkþ ¼ ð1 þ z 2 ðk 2 1ÞÞ þ r3 ðkþ ð32þ Monte Carlo analysis of 100 iterations was performed on the prediction of the coloured and nonlinear input signals. Figs. 2 and 3 show, respectively, the performance curve of the RTRL and AARTRL algorithm on coloured and nonlinear input signals. For both cases, the AARTRL algorithm outperformed the RTRL algorithm. The AARTRL algorithm gives a faster error convergence. To further investigate the algorithm, the AARTRL algorithm was used to predict signals with rich dynamics such as speech. Fig. 4 shows the plot of the adaptive amplitude, l: Fig. 4. Adaptive amplitude for AARTRL having the same l in the whole network on speech signals.

1100 S.L. Goh, D.P. Mandic / Neural Networks 16 (2003) 1095 1100 Table 1 Mean Squared Error (MSE) obtained for each model on nonlinear and coloured input Learning algorithm model MSE obtained Nonlinear input Coloured input Average SD Average SD RTRL with fixed amplitude 0.0146 0.0115 0.0309 0.019 RTRL with single trainable 0.0118 0.0161 0.0120 0.0172 amplitude RTRL with layer-by-layer 0.0115 0.0182 0.0119 0.0172 trainable amplitude RTRL with neuron-by-neuron trainable amplitude 0.0110 0.0177 0.0116 0.0166 It is clearly shown that the amplitude of l adapts to the dynamics of the signal accordingly. The quantitative results simulations are summarized in Table 1. This table shows the evaluation of the performance of the resulting network given in terms the average value of the cost function EðkÞ: From the experiment, case 3 has the best performance compared to cases 1 and 2. According to Trentin, although case 3 may seem more powerful, allowing for training of individual amplitudes, the resulting algorithm has more parameters meaning more complication and complexity involved. 5. Conclusion The amplitude of the activation function in the RTRL algorithm has been made adaptive by employing a gradient descent based approach trainable amplitude. The proposed algorithm has been shown to converge faster than the standard RTRL algorithm for nonlinear adaptive prediction. References Haykin, S. (1999). Neural networks: A comprehensive foundation. Englewood Cliffs, NJ: Prentice-Hall. Mandic, D. P., & Chambers, J. A. (2001). Recurrent neural networks for prediction: Learning algorithms, architectures and stability. London: Wiley. Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), 4 27. Trentin, E. (2001). Network with trainable amplitude of activation functions. Neural Networks, 14, 471 493. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270 280.