This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Size: px

Start display at page:

Download "This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore."

Byron Cooper
5 years ago
Views:

1 This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title A robust recurrent simultaneous perturbation stochastic approximation training algorithm for recurrent neural networks Authors) Xu, Zhao; Song, Qing; Wang, Danwei Citation Xu, Z., Song, Q., & Wang, D. 0). A robust recurrent simultaneous perturbation stochastic approximation training algorithm for recurrent neural networks. Neural Computing and Applications, -). -. Date 0 URL Rights 0 Springer-Verlag London Limited. This is the author created version of a work that has been peer reviewed and accepted for publication by Neural Computing and Applications, Springer-Verlag London Limited. It incorporates referee s comments but changes resulting from the publishing process, such as copyediting, structural formatting, may not be reflected in this document. The published version is available at: [

2 Neural Computing and Applications A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for Recurrent Neural Networks --Manuscript Draft-- Manuscript Number: Full Title: Article Type: Keywords: Corresponding Author: NCA-0R A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for Recurrent Neural Networks Original Article Recurrent neural networks; stability and convergence analysis; spsa Zhao Xu, Ph.D. Singapore, SINGAPORE Corresponding Author Secondary Information: Corresponding Author's Institution: Corresponding Author's Secondary Institution: First Author: Zhao Xu, Ph.D. First Author Secondary Information: Order of Authors: Zhao Xu, Ph.D. qing song Danwei Wang Order of Authors Secondary Information: Response to Reviewers: Replies to reviewer : Comments: For simplicity of notations, the time subscripts of several parameters are ignored. However, the time subscripts should be kept in some definitions such as ), ) to differentiate variables and constants. Replies: the time subscripts have been added in the equations, please see Page and Page. Comments: Spacing is irregular between paragraphs and eqs, e.g. Page. Replies: Page has been revised. Comments: In the simulation, the author simply shows the results without detailed explanation the reason why RRSPSA gets better performance compared with other algorithms. Replies: Simulation results show that the BP training algorithm performs worst as compared to other methods. The reason is that the second partial derivatives are not considered in the BP training. For RTRL, the slower convergence compared with RRSPSA is mainly caused by the small fixed learning rate and absent of the recurrent hybrid adaptive parameter for guaranteed weight convergence in the presence of noise. RRSPSA utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Please see Page 0. Comments: The paper seems only considers the synaptic dynamic, does the author considers the neuron dynamics? Replies: Neural engineers have usually studied only the synaptic dynamic system with weight updating or the neural dynamic system with neuron state changing independently, which are equally important. However, since the celebrated Powered by Editorial Manager and Preprint Manager from Aries Systems Corporation

3 backpropagation training algorithm has been successfully applied to multilayered RNNs like the Jordan and Elman networks, taxonomy boundaries between studies of the two dynamic systems become somehow fuzzy. It should be pointed that our study does not deal with the neural dynamic system because the speed of weight convergence is usually much slower than that of neurons. Powered by Editorial Manager and Preprint Manager from Aries Systems Corporation

4 Response to Reviewer Comments Comments: For simplicity of notations, the time subscripts of several parameters are ignored. However, the time subscripts should be kept in some definitions such as ), ) to differentiate variables and constants. Replies: the time subscripts have been added in the equations, please see Page and Page. Comments: Spacing is irregular between paragraphs and eqs, e.g. Page. Replies: Page has been revised. Comments: In the simulation, the author simply shows the results without detailed explanation the reason why RRSPSA gets better performance compared with other algorithms. Replies: Simulation results show that the BP training algorithm performs worst as compared to other methods. The reason is that the second partial derivatives are not considered in the BP training. For RTRL, the slower convergence compared with RRSPSA is mainly caused by the small fixed learning rate and absent of the recurrent hybrid adaptive parameter for guaranteed weight convergence in the presence of noise. RRSPSA utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Please see Page 0. Comments: The paper seems only considers the synaptic dynamic, does the author considers the neuron dynamics? Replies: Neural engineers have usually studied only the synaptic dynamic system with weight updating or the neural dynamic system with neuron state changing independently, which are equally important. However, since the celebrated backpropagation training algorithm has been successfully applied to multilayered RNNs like the Jordan and Elman networks, taxonomy boundaries between studies of the two dynamic systems become somehow fuzzy. It should be pointed that our study does not deal with the neural dynamic system because the speed of weight convergence is usually much slower than that of neurons.

5 Manuscript Click here to download Manuscript: A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for R Click here to view linked References Noname manuscript No. will be inserted by the editor) A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for Recurrent Neural Networks Zhao Xu Qing Song Danwei wang Received: date / Accepted: date Abstract Training of recurrent neural networks RNNs) introduces considerable computational complexities due to the need for gradient evaluations. How to get fast convergence speed and low computational complexity remains a challenging and open topic. Besides, the transient response of learning process of RNNs is a critical issue, especially for on-line applications. Conventional RNNs training algorithms such as the backpropagation through time BPTT) and real-time recurrent learning RTRL) have not adequately satisfied these requirements because they often suffer from slow convergence speed. If a large learning rate is chosen to improve performance, the training process may become unstable in terms of weight divergence. In this paper, a novel training algorithm of RNN, named robust recurrent simultaneous perturbation stochastic approximation RRSPSA), is developed with a specially designed recurrent hybrid adaptive parameter and adaptive learning rates. RRSPSA is a powerful novel twin-engine simultaneous perturbation stochastic approximation SPSA) type of RNN training algorithm. It utilizes three specially designed adaptive parameters to maximize training speed for a recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements as the original SPSA algorithm. The RRSPSA is proved with guaranteed weight convergence and system stability in the sense of Lyapunov function. Computer simulations were carried out to demonstrate applicability of the theoretical results. Zhao Xu School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore Tel.: xuzh000@e.ntu.edu.sg Qing Song School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore Tel.: eqsong@e.ntu.edu.sg Danwei Wang School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore Tel.: EDWWANG@e.ntu.edu.sg

6 Zhao Xu et al Main Notation List k Discrete time step. m Output dimension. n I Input dimension. n h Hidden neuron number. p v = m n h. Dimension of output layer weight vector. p w = n i n h. Dimension of hidden layer weight vector. l Number of time-delayed outputs. u i k) i =,...,m. External input at time step k. xk) R n I. Neural network input vector. xk) R n I. Perturbation vectors on input xk ). yk) R m. Desired output vector. ŷk) R m. Estimated output vector. εk) R m. Total disturbance vector. ek) R m. Output estimation error vector. V k) R pv. Estimated weight vector of output layer. V k) R pv. Ideal weight vector of output layer. V k) R pv. Estimated error vector of output layer. Wk) R pw. Estimated weight vector of hidden layer. W k) R pw. Ideal weight vector of hidden layer. Wk) R pw. Estimated weight vector of hidden layer. Ŵ j,: k) R n i. The jthe row vector of hidden layer weight matrix Ŵk). δ v k) Equivalent approximation errors of the loss function of output layer. δ w k) Equivalent approximation errors of the loss function of hidden layer. ck) Perturbation gain parameter of the SPSA. v k) R pv. Perturbation vectors of output layer. r v k) R pv. Perturbation vectors of output layer. w k) R pw. Perturbation vectors of hidden layer. r w k) R pw. Perturbation vectors of hidden layer. H Wk),xk)) = Hk) R m pw. Nonlinear activation function matrix. h j k) = h Ŵ j,: k)xk) ). The nonlinear activation function and the scalar element of H V k),xk)). α v k) Adaptive learning rate of output layer. α w k) Adaptive learning rate of hidden layer. α v,α w Positive scalars. ρ v k) Normalization factor of output layer. ρ w k) Normalization factor of hidden layer. β v k) Recurrent hybrid adaptive parameter of output layer.

7 Title Suppressed Due to Excessive Length β w k) Recurrent hybrid adaptive parameter of hidden layer. µ j k) R; j n h. Mean value of the input vectors of the jth hidden layer neuron. µ j k) R; j n h. Mean value of the hidden layer weight vector of the jth hidden layer neuron. τ Positive scalar. λ Positive gain parameter of the threshold function. η A small perturbation parameter. Introduction Recurrent neural networks RNNs) are inherently dynamic nonlinear systems which have signal feedback connections and are able to process temporal sequences. RNNs with free synaptic weights are more useful for complex dynamical systems such as nonlinear dynamic system identification, time-series prediction, and adaptive control as compared to Hopfield neural networks HNNs), in which the weights are fixed and symmetric [ ]. To maximize potentials and capabilities of the RNNs, one of the preliminary tasks is to design an efficient training algorithm with fast weight convergence speed and low computational complexity. Gradient descent optimization techniques are often used for neural networks which consist of adjustable weights along the negative direction of error gradient. However, due to the dynamic structure of RNNs, the obtained error gradient tends to be very small in a classical BP training algorithm with the fixed learning step, because RNN weight depends on not only the current output but also output of past steps. It is widely acknowledged that recurrent derivative of the multilayered RNNs, i.e. the recurrent training signal see Fig. ), which contains information of dynamic time dependence in training, is necessary to obtain good time response of RNNs [, 0]. From computational efficiency and weigh convergence points of view, Spall proposed a simultaneous perturbation stochastic approximation SPSA) method to find roots and minima of neural network functions which are contaminated with noise in []. The name of simultaneous perturbation arises from the fact that all elements of unknown weight parameters are being varied simultaneously. SPSA has the potential to be significantly more efficient than the usual p- dimensional algorithms of Kiefer-Wolfowitz/Blum type) that are based on standard finite-difference gradient approximations and it is shown in [] that approximately the same level of estimation accuracy can typically be achieved with only l/pth the amount of data needed in the standard approach. The main advantage of SPSA is its simplicity and excellent weight convergence property, in which only two objective function measurements are used. Thus, SPSA is more economical in terms of loss function evaluation which is usually the most expensive part of an optimization problem. SPSA has attracted great attention in application areas such as neural networks training, adaptive control and model parameter estimation [, ].

8 Zhao Xu et al Conventional training methods of RNNs are usually classified as backpropagation through time BPTT) [, ] and real time recurrent learning RTRL) [, 0] algorithms which often suffer from drawbacks of slow convergence speed and heavy computational load. In fact, BPTT cannot be used for online application due to computational problems. For a standard RTRL training, recursive weight updating requires a small learning step to obtain weight convergence and system stability of RNNs. If a large learning rate is selected to speed up weight updating, the training process of RTRL may lead to a large steady-state error or even weight divergence [, ]. In the previous work, we have developed an adaptive SPSA based training scheme for FNNs and demonstrated that the adaptive SPSA algorithm can be used to derive a fast training algorithm and guaranteed weight convergence []. In this paper, we will extend our SPSA research to the RNNs with free weight updating and propose a robust recurrent simultaneous perturbation stochastic approximation RRSPSA) algorithm under the framework of deterministic system with guaranteed weight convergence. Compared with FNNs, RNNs are dynamic systems and contain recurrent connections in their structure. Considering the time dependence of the signals of RNNs, not only the weights at the current time steps are to be perturbed, but also those at the previous time steps, which makes the learning easy to be diverge and the convergence analysis complicated to obtain. The key characteristic of RRSPSA training algorithm is to use a recurrent hybrid adaptive parameter together with adaptive learning rate and dead zone concept. The recurrent hybrid adaptive parameter can automatically switch off recurrent training signal whenever the weight convergence and stability conditions are violated see Fig. and explanation in Section III and IV). There are several advantages to train RNNs based on RRSPSA algorithm. First, RRSPSA is relatively easy to implement as compared to other RNN training methods such as BPTT due to its RTRL type of training and simplicity. Second, RRSPSA provides excellent convergence property based on simultaneous perturbation of weight, which is similar to the standard SPSA. Finally, RRSPSA has capability to improve RNN training speed with guaranteed weight convergence over the standard RTRL algorithm by using adaptive learning method. RRSPSA is a powerful twin-engine SPSA type of RNN training algorithm. It utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements as a standard SPSA algorithm. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Robust stability and weight convergence proofs of RRSPSA algorithm are provided based on Lyapunov function. The rest of this paper is organized as follows. In Sections II, we briefly introduce the structure of RNN and the conventional SPSA algorithm. In Sections III and IV, the RRSPSA algorithm and robustness analysis are derived for the output and hidden layers, respectively. Computer simulation results are presented in Section V to show the efficiency of our proposed RRSPSA. Section VI draws the final conclusions.

9 Title Suppressed Due to Excessive Length Basics of the RNN and SPSA To simplify notation of this paper, we consider a three-layer m-external input and m-output RNN with the output feedback structure as shown in Fig., which can be easily extended into a RNN with arbitrary numbers of inputs and outputs. Output vector of the RNN is ŷk) which can be presented as ŷk) = H Wk), xk)) V k) = [ŷ k),, ŷ m k)] T R m ) where V k) R pv p v = m n h ) is the long vector of the output layer weight matrix ˆV k) R n h m defined by V k) = [ ˆV, k),, ˆV nh,k),, ˆV,m k),, ˆV nh,mk) ] T and xk) R n I n I = l + ) m) is the input vector of the RNN defined by xk) = [x k), x k),, x ni k)] T = [ ŷ T k ),, ŷ T k l), u k),, u m k) ] T ) and Ŵk) R pw p w = n I n h ) is the long vector of the hidden layer weight matrix Ŵk) R n h n I defined by Wk) = [ Ŵ,: k), Ŵ,: k),, Ŵ nh,:k) ]. ) H Wk), xk)) R m pv is the nonlinear activation block-diagonal matrix given by h k),,h nh k), 0, 0,. H k) = H Wk), xk)) = 0,.... ) 0, 0, h k),,h nh k) where h j k) with j n h is one of the most popular nonlinear activation functions, h j k) = h Ŵ j,: k)xk) ) = ) + exp λŵ j,: k)xk) ) ) with Ŵ j,: k) R n I being the jth row vector of Ŵk) and λ > 0 is the gain parameter of the threshold function that is defined specifically for ease of use later when deriving the sector condition of the hidden layer. The output vector ŷk) is an approximation of the ideal nonlinear function yk) = H W,xk)) V with V and W being the ideal output layer and hidden layer weight respectively. SPSA algorithm is to solve the problem of finding an optimal weight θ of the gradient equation of neural networks ĝ ˆθk)) = L ˆθk)) ˆθk) for some differentiable loss function L : R p R, where ˆθk) is the estimated weight vector of neural networks. The basic idea is to vary all elements of the vector ˆθk) = 0

10 Zhao Xu et al. Fig. Block diagram of RRSPSA training for the output feedback based RNN. simultaneously and approximate the gradient function using only two measurements of the loss function ) and ) without requiring exact derivatives or a large number of function evaluation which provides significant improvement in efficiency over the standard gradient decent algorithms. Define ε + k) g + θ k)) = Lθ k) + ck) k)) + ε + k) ) g θ k)) = Lθ k) ck) k)) + ε k) ) ε k) and represent measurement noise terms, ck) is a scalar and where k) = [ k),..., p k)] is the perturbation vector. Then, the form of the estimation of gθ ) at the kth iteration is h g θ k)) = g + θ k)) g θ k)) ck) k)... g + θ k)) g θ k)) ck) p k) it ) Theory and numerical experience in [] indicate that SPSA can be significantly more efficient than the standard finite difference-based algorithms in large-dimensional problems. Inspired by the excellent economical and convergence properties of SPSA, we propose the RRSPSA algorithm under the framework of deterministic system based on the idea of simultaneous perturbation with guaranteed weight convergence. The loss function for RRSPSA is defined as Lθ k)) = kek)k ek) = yk) y k) + ε k) θ, 0) ) where yk) is the ideal output and a function of y k) is a function of θ k), and ε k) is a disturbance term. Similar to SPSA, the RRSPSA algorithm seeks to find an optimal weight vector θ of gradient equation, i.e., the weight vector that minimized

11 Title Suppressed Due to Excessive Length differentiable L ˆθk)). That is, the proposed RRSPSA algorithm for updating ˆθk) R pw +p v) as an estimation of the ideal weight vector θ is of the form ˆθk + ) = ˆθk) αk + )ĝ ˆθk)) ) where αk + ) is an adaptive learning rate and ĝ ˆθk)) is an approximation of the gradient of the loss function. This approximated gradient function normalized) is of the SPSA form ĝ ˆθk)) = L ˆθk) + ck + ) k + ) ) L ˆθk) ck + ) k + ) ) + δk) rk+) ck + )ρk + ) ) where δk) is an equivalent approximation error of the loss function, which is also called the measurement noise, and ρ+) is the normalization factor. k +), rk + ), and ck + ) > 0 are the controlling parameters of the algorithm defined as. The perturbation vector k + ) = [ k + ),, p w +p vk + )]T R pw +p v ) is a bounded random directional vector that is used to perturb the weight vector simultaneously and can be generated randomly.. The sequence of rk + ) is defined as p w +p ) [ ] rk + ) = k + ),, T R pw +p v). ) vk + ). ck +) > 0 is a sequence of positive numbers [,,]. ck +) can be chosen as a small constant number for a slow time-varying system as pointed in [] and it controls the size of perturbations. Note that for simplicity of notations, we ignore the time subscripts of several parameters where no confusion is caused such as k + ), rk + ), ck + ), δk) and αk + ). Output Layer Training and Stability Analysis In a multilayered neural network, the output layer weight vector V k) and hidden layer weight vector Wk) are normally updated separately using different learning rates as shown in Fig. as in the standard BP training algorithm []. During the updating of the output layer weights, the hidden layer weights are fixed. We are now ready to propose the RRSPSA algorithm to update the estimated weight vector V k) of output layer. That is, V k + ) = V k) α v k + )ĝ v V k)) ) where ĝ v V k)) R pv is the normalized gradient approximation that uses simultaneous perturbation vectors v k + ) R pv and r v k + ) R pv to stimulate weight.

12 Zhao Xu et al v k +) and r v k +) are, respectively, the first p v components of perturbation vectors defined in ) and ). ĝ v V k)) = e T k)hk) v r v ρ v ) β v k + )e T k)ak)r v cρ v k + )) ) where Ak) R m is the extended recurrent gradient defined as Ak) = H Wk), ˆD v k) V k) + c v ) ) V k) H Wk), ˆD v k) V k) c v ) ) V k) ) [ T ˆD v k) = H T k ),, H T k l),0,, 0]. ) For simplicity of notations, we ignore the time subscripts of v k + ), r v k + ), β v k + ), ρ v k + ) and α v k + ) in the following part of the paper. Remark Our purpose is to find the desired output layer weight vector V of the gradient equation ĝ V k)) = L V k)) = 0. V k) Inspired by the excellent properties of SPSA, we obtain ĝ V k)) by varying all elements of unknown weight parameters simultaneously as shown in the proof of Lemma under the framework of deterministic system. We will show that RRSPSA has the same form as SPSA, and only two objective function measurements are used at each iteration, which maintains the efficiency of SPSA. However, the weight convergence proof and stability analysis are quite different from those in [] under the framework of deterministic system as shown in Theorem and. For RNNs, the outputs depend on not only the weights at current time steps, but also those at the previous time steps since the current input vector contains the delayed outputs. The first term on the right side of equation in ) is the result of perturbation of the feedforward signals in RNNs and Ak) in ) is the result of perturbation of the recurrent signals which makes the training complicated and the system easy to be unstable. In addition to the simplicity and efficiency of RRSPSA, one powerful aspect of the developed RRSPSA algorithm is that it is able to use effectively large adaptive learning rate α v to accelerate weight convergence speed with guaranteed system stability. To avoid weight divergence at each time step, an adaptive learning rate β v is used in ) to control the recurrent signals and switch off the recurrent connections whenever the convergence condition is violated as shown in ). The normalization factor ρ v is also used to bound the signals in learning algorithms as traditionally did in adaptive control system [,]. α v is to guarantee the weight convergence during training, which means the weights are only updated when the convergence and stability requirements are met and it does not need to be small as in RTRL training algorithm. The sufficient conditions for the three key parameters are shown in ), ), and ). ˆD v k) is obtained during the mathematical deduction in the proof of Lemma. Lemma The normalized gradient approximation in ) is an equivalent presentation of the standard SPSA algorithm in equation ).

13 Title Suppressed Due to Excessive Length Proof We will prove that ) is obtained by perturbing all the weights simultaneously as SPSA did. Rewrite the loss function defined in 0) as L ˆθk) ) = L V k), Wk)) = yk) ŷk) + εk). 0) Further, we use ŷk) = Hk) V k) in ) to define ŷ v+ V k)) =H V k) + c v ) + H W,xk) + xk + ))) V k) =H V k) + c v ) + H W, [ŷt k ),, ŷ T k l), 0,, 0 ] ) ) T + xk + ) V k) =H V k) + c v ) + H W,[ V T k )H T k ),, T ) V T k l)h T k l),0,, 0] ) + xk + ) V k) [ =H V k) + c v ) + H W, V k ) + c v ) T H T k ), ] ), V k l) + c v ) T H T k l), 0,, 0 T V k). ŷ v+ V k)) has the same motivation as g + ˆθk)) of SPSA shown as in ) which is to perturbs all the weights simultaneously. Different from FNN training which only perturb the current weight, the input vector must be perturbed as well for RNN training because of the time dependence in RNNs. The disturbance xk + ) is to perturb the weights at previous time steps which are contained in xk) as seen in ). However, the last equation of ) is difficult to implement. Here we assume the model parameters do not change apparently between each iteration, i.e. V k) V k ) V k l). We can derive a similar approach as RTRL introduced by Williams and Ziper [, 0]. Moreover, such approximation makes the proposed algorithm real time and adaptive in a recursive fashion. Under the approximation, ŷ v+ V k)) H k) V k) + c v ) + H [ Wk), V k) + c v ) T H T k ), ] ), V k) + c v ) T T H T k l), 0,, 0 V k) [ ] T =H k) V k) + c v ) + H Wk), H T k ),, H T k l),0,, 0 ) V k) + c v ) V k) =H k) V k) + c v ) + H Wk), ˆD v k) V k) + c v ) ) V k) ) )

14 0 Zhao Xu et al where ˆD v k) is defined in ). By the approximation V k) V k ) V k l), the recurrent term H T k ),, H T k l) of ) can be iteratively obtained from the previous steps to speed up the calculation of RRSPSA normally, the input uk) is independent from weight so that the last entries of ˆD v k) are zero vectors). Weight convergence and system stability can be guaranteed by using the three adaptive learning rates which will be explained detaily later. Similar to ), we can get so that ŷ v V k)) = Hk) V k) c v ) + H Wk),xk) xk + ))) V k) ) Hk) V k) c v ) + H Wk), ˆD v k) V k) c v ) V k) ) ŷ v+ V k)) ŷ v V k)) V k))hk)c v + Ak) ) where the extended recurrent gradient Ak) is defined as ). Ak) in ) is the result of recurrent connections. To guarantee weight convergence and system stability during training, we insert an adaptive recurrent hybrid parameter β v k + ) see Fig. ) to control the recurrent training signal. As shown in ), the convergence condition will be examined at each step and the recurrent contribution will be switched off whenever the convergence condition is violated. Thus, ) can be rewritten as follows ŷ v+ V k)) ŷ v V k)) Hk)c v + β v Ak). ) According to ), we can get ĝ v V k)) = L Wk), V k) + c v ) L Wk), V k) c v ) + δ v k) cρ v yk) ŷ v+ V k) ) yk) ŷ v V k) ) + δ v k) = cρ { v yk) = ŷ v+ V k) ) + yk) ŷ v V k) )) T ŷ v V k) ) ŷ v+ V k) )) } + δ v k) r v cρ v) =yk) ŷk) + εk)) T ŷ v V k) ) ŷ v+ V k) )) r v cρ v) =e T k) ŷ v V k) ) ŷ v+ V k) )) r v cρ v) e T k)h k) v r v ρ v) β v e T k)ak)r v cρ v). ek) in the fifth equality of ) comes from ). The fourth equality of ) reveals the relationship between gradient approximation error δ v k) and system disturbance r v r v )

15 Title Suppressed Due to Excessive Length εk), i.e., δ v k) = ŷk) + ŷ v+ V k)) + ŷ v V k)) + εk) ) T ŷ v V k)) ŷ v+ V k)) ) = ε T k) ŷ v V k)) ŷ v+ V k)) ). In the fifth equality of ), the gradient approximation presentation appears exactly the same as the original one in [] see [, eq..)]). Only two objective function measurements are used at each step which maintains the efficiency of SPSA. The training error ek) comes from the weight estimation error and the disturbance. For stability analysis, ek) will be linked to the weight estimating error of the output layer and the disturbance. Define the weight estimating errors of the output layer and hidden layer as V k) = V V k) ) ) Wk) = W Wk) ) where W and V are the ideal optimal) weight vectors of Wk) and V k). According to ), the training error can be rewritten as ek) = yk) ŷk) + εk) = H W,xk)) V Hk) V k) + εk) = H W,xk)) V Hk) V + Hk) V Hk) V k) + εk). Because the term H W,xk)) V Hk) V is temporarily bounded during output layer training hidden layer weights are fixed), we define ẽ v k) = H W,xk)) V Hk) V + εk) = H Wk), W,xk)) V + εk) with H Wk), W,xk)) = H W,xk)) Hk). The training error caused by the weight estimation error of the output layer is defined as Then, 0) can be rewritten as 0) ) φ v k) = Hk) V k). ) ek) = φ v k) + ẽ v k). ) Hence, the training error is directly linked to the weight estimating error of the output layer contained in φ v k) and the disturbance ẽ v k). This is critical for the convergence proof. Theorem If the output layer of the RNN is trained by RRSPSA algorithm ), V k) is guaranteed to be convergent in the sense of Lyapunov function V k + ) V k) 0, k with V k) = V V k) and the training error will be L -stable in the sense of ẽ v k) L and φ v k) L.

16 Zhao Xu et al Proof See Section. The conditions for the three adaptive learning rates in ) and ) are as follows. The detailed procedure to obtain these conditions is shown in the proof of Theorem.. ρ v k + ) is the normalization factor defined by { ρ v k + ) = τρ v α v } Zk) k) + max p v + 0.β v +, ρ Y k) where 0 < τ <, 0 < α v, and ρ are positive constants and Y k) = rv ) T ηi + H T k)hk) ) H T k)ak) c where η is a small perturbation parameter. ) ) Zk) = Hk)c v + β v Ak)) r v c) ). ). α v k + ) is an adaptive learning rate similar to dead zone approach in the classical neural network and adaptive control { α v k + ) = α v, if ek) Λk) α v ) k + ) = 0, if ek) < Λk) with Λk) = ẽ v mk)/ αv ρ v ) Zk) p v +0.β v Y k) and ẽv mk) being the maximum value of ẽ v k) defined in ).. β v k + ) is the recurrent hybrid adaptive parameter determined by { β v k + ) =, ify k) 0 β v ) k + ) = 0, ify k) < 0. Remark According to the theoretical analysis, three parameters β v k+), α v k+ ) and ρ v k+) play a key role in the proof of weight convergence and stability analysis. They are designed to guarantee the negativeness of Lyapunov function as in ). The function of β v k +) is to automatically switch off the important recurrent training signal in the lower left corner of Fig. when the convergence and stability requirements are violated resulted from recurrent training signals. The normalization factor ρ v k + ) is also used to bound the signals in learning algorithms as traditionally did in adaptive control system [, ]. α v k + ) is to guarantee weight convergence during training, which means that the weights are only updated when convergence and stability requirements are met. Besides, α v k + ) does not need to be small as in RTRL training algorithm which will make the training by RRSPSA faster than that by RTRL. From the definition of α v k +) in ), it can be seen that the weights will be updated α v k + ) = α v ) only if the training error is bigger than the dead zone value which means the learning will be stopped if the training error is lower than the dead zone level. This is to provide good generalization other than to learn the system noise. In this paper, we will choose a suitable value for ẽ v mk) according to the noise level.

17 Title Suppressed Due to Excessive Length Remark Perturbation size ck + ) is a critical variable as discussed in the original SPSA papers [,, ]. A natural question is whether ck + ) will affect the learning procedure in the recurrent learning algorithm. The answer is yes. It is not obvious to see from the updating formula of ĝ v V k)) in ). However, this can be easily seen by expanding the second term Ak) in ) to be explicit in ck). According to the mean value theory and that activation function in ) is a nondecreasing function, it can be readily shown that there exist unique positive mean values µ j k) such that h Ŵ j,: k)x k) ) h Ŵ j,: k)x k) ) = µ j k)ŵ j,: k)x k) x k)) ) where Ŵ j,: k) is the estimated weight vector linked to the jth hidden layer neurons. From ) and the matrix in ), Ak) in ) can be further extended as Ak) = cωk) 0) where Ωk) R m is defined as Ω k),, Ω nh k), 0, 0,. Ωk) = 0,... V k) ). 0, 0, Ω k),, Ω nh k) and Ω j k) = µ j k)ŵ j,: k) ˆD v k) v, j =,, n h. By such expanding of Ak), ck) is explicit as shown in 0). To reveal the effect of ck), we further expand ) as ŷ v+ V k)) ŷ v V k)) H k)c v + β v k + )cωk). ) Using ) and ), we have the presentation of the system disturbance in terms of the measurement noise δ v k) of the RRSPSA algorithm as } εk) = {H k) v + β v Ωk))H k) v + β v Ωk))) T H k) v + β v Ωk))δ v k)c). Furthermore, according to the equivalent disturbance ), we have ẽ v k) = H Wk), W,xk)) V + εk) { = H Wk), W,xk)) V H k) v + β v Ωk)) } H k) v + β v Ωk))) T H k) v + β v Ωk))δ v k). c The equivalent disturbance ẽ v k) is important since its maximum value ẽ v mk) decides the range of dead zone for α v in ). Note also that the measurement noise δ v k) is defined the same way as in [], which measures difference between true gradient and gradient approximation as derived in ). Because the system disturbance εk) is ) )

18 Zhao Xu et al decided only by external conditions, a relatively smaller gain parameter c will imply a smaller measurement noise δ v k) according to ). This allows us to choose the dead zone range mainly according to the external system disturbance εk). Furthermore, as shown in ), α v will be nonzero whenever the norm of ek) is bigger than a dead zone value which is related to ẽ v mk). If the external disturbance is also small, we can choose a small dead zone range, which ensures that the adaptive learning rate α v has the nonzero value for a prolong period of training time. Hidden Layer Training and Stability Analysis The algorithm for updating estimated weight vector Wk) of the hidden layer is Wk + ) = Wk) α w k + )ĝ w Wk)) ) where ĝ w Wk)) R pw is the normalized gradient approximation that uses simultaneous perturbation vectors w k + ) R pw and r w k + ) R pw to stimulate weight of the hidden layer. w k +) and r w k +) are, respectively, the last p w components of perturbation vectors defined in ) and ). ĝ w Wk)) = et k) B + k) B k) + β w k + ) ˆD w k) ) cρ w r w k + ) ) k + ) where B + k) = H Wk) + c w k + ),xk)) V k) R m ) B k) = H Wk) c w k + ),xk)) V k) R m ) [ B ˆD w k) = H Wk), + k ) ) T,, B + k l) ) ] T ) T, 0,, 0 [ B H Wk), k ) ) T,, B k l) ) ] T )) T, 0,, 0 V k) R m. For simplicity of notations, we ignore the time subscripts of w k + ), r w k + ), β w k + ), ρ w k + ) and α w k + ) in the following part of the paper. Remark The same as the output layer, our purpose is to find the desired hidden layer weight vector W of the gradient equation ĝ Wk)) = L Wk)) Wk) = 0. The detailed mathematical deduction is given in the proof of Lemma. B + k) B k) is the result of the perturbations of the feedforward signals and ˆD w k) is the result of the perturbations of the recurrent signals in RNNs. The function of β w is to automatically switch off recurrent contribution when the convergence and stability requirements are violated. The conditions for the key parameters appeared in ) and ) are shown in ), ), and ). )

19 Title Suppressed Due to Excessive Length Lemma The normalized gradient approximation in ) is an equivalent presentation of the standard SPSA algorithm in equation ). Proof Similar to the definitions of ĝ v+ V k)) and ĝ v V k)) in ) and ), we define two perturbation functions ĝ w+ V k)) and ĝ w V k)), i.e., ĝ w+ Wk)) = H Wk) + c w,xk)) V k) + H Wk),xk) + xk + ))) V k). 0) From the definition of B + k) in ), ĝ w+ Wk)) =H Wk) + c w,xk)) V k) + H Wk),xk) + xk + ))) V k) =B + k) + H Wk), [ŷt k ),, ŷ T k l), 0,, 0 ] ) ) T + xk + ) V k) =B + k) + H Wk),[ V T k )H T Wk ) + c w,xk ) ), V T k l)h T Wk l) + c w,xk l) ) ] ) T, 0,, 0 V k) [ B B + k) + H Wk), + k ) ) T,, B + k l) ) ] T ) T,0,, 0 V k). From the definition of B k) in ), and similar to ), we get [ B ĝ w Wk)) B k)+h Wk), k ) ) T,, B k l) ) ] T ) T,0,,0 V k). Then, we have ) ) ĝ w+ Wk)) ĝ w Wk)) B + k) B k) + ˆD w k). ) ˆD w k) in ) is the result of recurrent training signal as shown in Fig.. To guarantee weight convergence and system stability during training, we add an adaptive recurrent hybrid parameter β w k + ) to control the recurrent learning signal path. That is, ĝ w+ Wk)) ĝ w Wk)) B + k) B k) + β w ˆD w k). )

20 Zhao Xu et al According to ), we can get ĝ w Wk)) = L Wk) + c w, V k)) L Wk) c w, V k)) + δ w k) cρ w r w = yk) ŷw+ k) yk) ŷ w k) + δ w k) cρ w r w = yk) ŷw+ k) + yk) ŷ w k)) T ŷ w k) ŷ w+ k)) + δ w k) cρ w = yk) ŷ w+ k) ŷ w k) + ŷ w+ k) + ŷ w k) ŷk) + εk) ŷ w k) ŷ w+ k) ) + δ w k) ) r w cρ w ) = yk) ŷk) + εk)) T ŷ w k) ŷ w+ k) ) r w cρ w ) = e T k) ŷ w k) ŷ w+ k) ) r w cρ w ) e T k) B + k) B k) + β w ˆD w k) ) r w cρ w ) ) where the equivalent disturbance δ w k) is defined as δ w k) = ŷ w Wk)) ŷ w+ Wk)) ) T ŷw εk) + Wk)) ŷ w+ Wk)) ) T ŷ w+ Wk)) + ŷ w Wk)) ŷk) ). The fifth equality in ) reveals the relationship between δ w k) and εk) as in ). For stability analysis, the training error will be linked to the parameter estimating error of the hidden layer and the disturbance. According to mean value theorem and that activation function is a nondecreasing function, it can be readily shown that there exist unique positive mean values µ j k) such that r w ) T ) h W j,:xk) ) h Ŵ j,: k)xk) ) = µ j k) W j,: Ŵ j,: k) ) xk) = µ j k) W j,: k)xk) ) where Ŵ j,: k), W j,:, W j,: k) are the estimated weight, ideal weight, and weight estimate error vectors linked to the jth hidden layer neurons, respectively. From ) and the matrix in ), it can be shown that H Wk), W,xk)) V k) = H W,xk)) Hk)) V k) = Ωk) Wk) ) where Ωk) R m pw is defined as µ k) V k)x T k) µ nh k) V nh k)x T k) µ k) V nh +k)x T k) µ nh k) V nh k)x T k) Ωk) =.. µ k) V m )nh +k)x T k) µ nh k) V p vk)x T k) = [ µ k) ˆV,:k)x T T k) µ nh k) ˆV n T h,:k)x T k) ] )

21 Title Suppressed Due to Excessive Length and ˆV T j,: k) is inverse of jth row of the estimated weight matrix ˆV k) of output layer. From ), the maximum value of the derivative of h j k), ḣ j k) = hŵ j,: k)xk)))/ Ŵ j,: k)xk)) is λ, and hence 0 µ j k) λ i n h ). We can rewrite the estimation error as ek) = yk) ŷk) + εk) where = H W,xk)) V H W,xk)) V k) + H W,xk)) V Hk) V k) + εk) = H Wk), W,xk)) V k) + H W,xk)) V k) + εk) = φ w k) + ẽ w k) 0) φ w k) = H Wk), W,xk)) V k) = Ωk) Wk) ) and the bounded equivalent disturbance ẽ w k) of the hidden layer is defined as ẽ w k) = H W,xk)) V k) + εk). ) Note that ẽ w k) is a bounded signal because V k) has been proven to be bounded in Section III. According to the definition of B + k) and B k) defined in ) and ), we can get B + k) B k) ) c) = ˆΩk) w ) where ˆΩk) R m pw is similar to Ωk) in ) except with different mean values. Theorem If the hidden layer of the RNN is trained by the adaptive RRSPSA algorithm ), weight Wk) is guaranteed to be convergent in the sense of Lyapunov function Wk + ) Wk) 0, k with Wk) = W Wk) and the training error will be L -stable in the sense of ẽ w k) L and φ w k) L. Proof See Section. The three adaptive learning rates appeared in ) and ) are as follows.. The normalization factor is defined as { where ρ w k + ) = τρ w k) + max + α w Mk) r w ϒ k), ρ Mk) = B + k) B k) + β w ˆD w k) c ϒ k) = λ min λ pw + λc β w r w ) T ηi + Ω T k) Ωk) ) Ω T k) ˆD w k) } )

22 Zhao Xu et al and 0 < α w being a positive constant, λ being the maximum value of the derivative ḣ j k) of the activation function in ), and λ min 0 being the minimum nonzero value of the mean value µ j k) defined as and 0 µ j k) λ i n h ) ) V k)x T k) V nh k)x T k) V nh +k)x T k) V nh k)x T k) Ωk) =.. V m )nh +k)x T k) V p vk)x T k) = [ ˆV T,:k)x T k) ˆV T n h,:k)x T k) ] R m pw where V j k) is the jth element of V k), j p v.. The adaptive dead zone learning rate of hidden layer is defined as { α w k + ) = α w, if ek) Γ k) α w ) k + ) = 0, if ek) < Γ k) ) with Γ = ẽ w mk)/ αw r w ρ w ϒ k) Mk) and ẽ w mk) being the maximum value of ẽ w k) defined in ).. The recurrent hybrid adaptive parameter of hidden layer β w is defined as { β w k + ) =, if r w ) T χk) 0 β w k + ) = 0, if r w ) T χk) < 0 ) ) with χk) = ηi + Ω T k) Ωk) ) Ω T k) ˆD w k). β w k +) can automatically cut off the recurrent training signal as in the lower right corner of Fig. whenever weight convergence and stability conditions of the RNN are violated as shown in Theorem of section IV. Remark Summary of the RRSPSA algorithm for RNNs.. Set a maximum training step as the stop criterion;. Initialize the weights for the RNN;. Form new input xk) as defined in equation );. Perform forward computation and calculate the output ŷk) and error ek) with the input and current estimated weights;. Calculate ρ v k + ), α v k + ), and β v k + ) using ), ), and ), respectively;. Update the output layer weights V k + ) via the learning law ) for the next iteration;. Calculate ρ w k +), α w k +), and β w k +) using ), ), and ), respectively;. Update the hidden layer weights Wk + ) via the learning law ) for the next iteration;. Go back to step to continue the iteration until the stop criterion is reached.

23 Title Suppressed Due to Excessive Length Examples. Time series prediction Sunspots are dark blotches on the sun which are caused by magnetic storms on the surface of the sun. The underlying mechanism for sunspot appearances is not exactly known. The sunspots data is a classical example of a combination of periodic and chaotic phenomena which has served as a benchmark in the statistics literature of time series and much work has been done in trying to analyze the sunspot data using neural networks [, ]. One set of the high quality America Sunspot Numbers, monthly sunspot numbers from December to October 00 in the database, is used as an example to minimize unnecessary human measurement errors. The total America sunspot number in the monthly database is points. The output sequence is scaled by dividing by 00, to get into a range suitable for the network. The hidden neuron number of the RNN is ten. The selected learning rates for BP and RTRL are 0.0 and 0.0, respectively. Perturbation step size for RRSPSA is c = 0.0. Fig. shows outputs of the plant y k) and estimated outputs yk) of the RNN trained by the three different methods. Table gives comparison results of the three methods in terms of MSE and STD. Table Comparison of Mean Squared Error and Standard Deviation of the Training Error. Michaelis-Menten equation Squared Error RRSPSA RTRL BP MSE.0e.0e.0e STD.e.e.e Consider a continue time Michaelis-Menten equation [] yt) ẏt) = 0. yt) + ut) ). + yt) Given an unit impulse as input ut), a fourth-order Runge-Kutta algorithm is used to simulate this model with integral step size t = 0.0, and 000 equi-spaced samples are obtained from input and output with a sampling interval of T = 0.0 time units. The measured noise is white Gaussian noise with mean 0 and variance 0.0. An RNN is utilized to emulate the dynamics of the given system and approximate the system output in ). Gradient function of this model has many local minima and the global minimum has a very narrow domain of attraction []. We compare performance of the RRSPSA to that of both BP and RTRL training algorithms. The hidden neuron number of RNNs in this example is selected as ten. The fixed learning rates for BP and RTRL are 0.0 and 0.0, respectively. The perturbation size c = 0.0 is selected for RRSPSA. Fig. shows outputs of the plant y k) and estimated outputs yk) of

24 0 Zhao Xu et al Table Comparison of the Mean Squared Error and Standard Deviation of the Error RRSPSA RTRL BP MSE.e.e.e STD.e.e.e the RNN trained by the three different methods. Table shows the comparison results of the three methods in terms of MSE and STD. Simulation results show that the BP training algorithm performs worst as compared to other methods. The reason is that the second partial derivatives similar to that of as in ) are not considered in the BP training. For RTRL, the slower convergence compared with RRSPSA is mainly caused by the small fixed learning rate and absent of the recurrent hybrid adaptive parameter for guaranteed weight convergence in the presence of noise. RRSPSA utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Conclusion We proposed a robust recurrent simultaneous perturbation stochastic approximation RRSPSA) algorithm under the framework of deterministic system with guaranteed weight convergence. Compared with FNNs, RNNs are dynamic systems and contain recurrent connections in their structure. Considering the time dependence of the signals of RNNs, not only the weights at the current time steps are to be perturbed, but also those at the previous time steps, which makes the learning easy to be diverge and the convergence analysis complicated to obtain. The key characteristic of RRSPSA training algorithm is to use a recurrent hybrid adaptive parameter together with adaptive learning rate and dead zone concept. The recurrent hybrid adaptive parameter can automatically switch off recurrent training signal whenever the weight convergence and stability conditions are violated see Fig. and explanation in Section III and IV). There are several advantages to train RNNs based on RRSPSA algorithm. First, RRSPSA is relatively easy to implement as compared to other RNN training methods such as BPTT due to its RTRL type of training and simplicity. Second, RRSPSA provides excellent convergence property based on simultaneous perturbation of weight, which is similar to the standard SPSA. Finally, RRSPSA has capability to improve RNN training speed with guaranteed weight convergence over the standard RTRL algorithm by using adaptive learning method. Robust stability and weight convergence proofs of RRSPSA algorithm are provided based on Lyapunov function. Proof of Theorem Proof According to the training algorithm in ), we can get V k + ) = V k) + α v ĝ v V k)) 0)

25 Title Suppressed Due to Excessive Length where ĝ v V k)) is defined in ), then we have V k + ) V k) =α v ĝ v V k)) T V k) + α v ĝ v V k)) α v et k)h k)c v + β v Ak)) cρ v r v ) T V k) + α v ĝ v V k)) { } { } c e T k)h k) v r v ) T V k) + β v e T k)ak)r v ) T V k) = α v cρ v + α v ĝ v V k)) }{ } = α {e v T k)h k) V k) V T k) V k) k)) V T v r v ) T V k) ρ v ) α {β v v e T k)hk) V k) Hk) V ) T k) Hk) V k) Hk) V ) ) T k) Ak)r v ) T H T k)hk) ) } H T k)hk) V k) cρ v ) + α v ĝ v V k)) =α v p v { e T k)φ v k) } ρ v ) + α v β v { e T k)φ v k) } trace{ Hk) V ) T k) Hk) V k) Hk) V ) T ) Ak)r k) v ) T H T k)hk) ) } H T k)hk) V k) cρ v ) + α v ĝ v V k)) α v p v { e T k)φ v k) } ρ v ) + α v β v { e T k)φ v k) } { trace Ak)r v ) T ηi + H T k)hk) ) H T k)hk) V k) Hk) V k) Hk) V k) Hk) V ) ) T } k) cρ v ) + α v ĝ v V k)) =α v p v { e T k)φ v k) } ρ v ) + α v { e T k)φ v k) } β v r v ) T ηi + H T k)hk) ) H T k)ak)cρ v ) + α v ĝ v V k)) ) Note that Hk) V k) in the fourth equality is replaced with φ v k) as defined in ). A small perturbation parameter is added to approximate [H T k)hk)] and guarantee a full rank of the inverse matrix of [ηi +H T k)hk)], which is similar to a general ) T

26 Zhao Xu et al nonlinear iteration algorithm []. V k + ) V k) α v p v { e T k)ẽ v k) ek)) } ρ v ) + α v { e T k)ẽ v k) ek)) } β v r v ) T ηi + H T k)hk) ) H T k)ak)cρ v ) + ek) ) α v Hk)c v + β v Ak) r v cρ v ) α v p v ρ v ) { ẽ v k) ek) } + α v { ẽ v k) ek) } β v r v ) T ηi + H T k)hk) ) H T k)ak)cρ v ) + ek) ) α v Hk)c v + β v r v Ak) cρ v ) =α v ρ v ) {p v + 0.β v Y k)} ẽ v k) } α v ρ v ) {p v + 0.β v Y k) α v ρ v ) Zk) ek) { } α v ρ v ) {p v + 0.β v Y k)} ẽ v k) αv ρk + ) Zk) ) ek) p v + 0.β v Y k) where Yk) and Zk) are defined in ) and ), respectively. Note also that the second inequality in ) is from the fact that e T k)ẽ v k) ẽ v k) + ek). p v is the dimension of the output layer weight vector which is positive. α v is a fixed positive scalar. β v Y k) 0 due to the definition of recurrent adaptive learning rate αv Zk) p v +0.β v Y k) β v k + ) in ) and ρ v > due to definition of the normalization factor in ). The adaptive learning rate α v k + ) in ) implies that recurrent training of RRSPSA will be α v only if the output error of RNN is bigger than a predefined value otherwise it will be zero, which guarantees V k + ) V k) is bounded below zero and is not increasing. The limit of V k + ) V k) therefore exists and { } ẽ v k) αv ρk+) Zk) p v +0.Y k) ) ek) will go to zero. Then, we finally have Proof of Theorem V k + ) V k) 0. Proof Consider that SPSA is an approximation of the gradient algorithm [], and using the property of local minimum points of the gradient e T k)ek) )) / Wk)) ) T Wk) 0 )

27 Title Suppressed Due to Excessive Length [], we have 0 e T k)ek) ) Wk) Wk)) T n h = {ḣi k)e T k) ˆV i,:k)x T T k) W } i,: k) = i= n h i= λ λ min { ḣi k) µ i k) { n h i= = λ λ min e T k) µ i k)e T k) ˆV i,:k)x T T k) W ) } i,: k) µ i k)e T k) ˆV T i,:k)x T k) W i,: k) Ωk) Wk)) = λ λ min e T k)φ w k) where λ is the maximum value of derivative ḣ j k) of activation function in ), and λ min 0 is the minimum nonzero value of mean value µ j k) defined in ). Note that the inequality is always true if λ min = 0, which implies ḣ j k) = µ j k) = 0.) According to the training algorithm for hidden layer in ), we get } ) Wk + ) = Wk) + α w ĝ w Wk)) ) where ĝ w Wk)) is defined in ). Using ), ) and ), by the definition of Ωk) in ), we get Wk + ) Wk) =α w ĝ w Wk)) Wk) + α w ĝ w Wk)) α w e T k) B+ k) B k)) + β w ˆD w k) c w r w ) T Wk) + α w ĝ w Wk)) n h { = µ i k)e T k) ˆV i,:k)x T T k) W i,: k)} { } W k) T Wk) W k)) T w r w ) T Wk) i= αw w αw e T k) Wk)) Ωk) Wk)) [ T Wk)) Ωk) Ωk) Ωk) Wk) ) ] T ˆD w k)r w ) T Wk)β w c w ) + α w ĝ w Wk)) { n h µ i k) = µ i k)e T k) ˆV T i= µ i k) i,:k)x T k) W ) } i,: k) p w α w w ) { n h + µ i k)e T k) ˆV T µ i k) i,:k)x T k) W ) } i,: k) α w c w ) i=

I. MAIN NOTATION LIST

I. MAIN NOTATION LIST IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 5, MAY 2008 817 Robust Neural Network Tracking Controller Using Simultaneous Perturbation Stochastic Approximation Qing Song, Member, IEEE, James C. Spall,