This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Size: px
Start display at page:

Download "This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore."

Transcription

1 This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore. Title A robust recurrent simultaneous perturbation stochastic approximation training algorithm for recurrent neural networks Authors) Xu, Zhao; Song, Qing; Wang, Danwei Citation Xu, Z., Song, Q., & Wang, D. 0). A robust recurrent simultaneous perturbation stochastic approximation training algorithm for recurrent neural networks. Neural Computing and Applications, -). -. Date 0 URL Rights 0 Springer-Verlag London Limited. This is the author created version of a work that has been peer reviewed and accepted for publication by Neural Computing and Applications, Springer-Verlag London Limited. It incorporates referee s comments but changes resulting from the publishing process, such as copyediting, structural formatting, may not be reflected in this document. The published version is available at: [

2 Neural Computing and Applications A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for Recurrent Neural Networks --Manuscript Draft-- Manuscript Number: Full Title: Article Type: Keywords: Corresponding Author: NCA-0R A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for Recurrent Neural Networks Original Article Recurrent neural networks; stability and convergence analysis; spsa Zhao Xu, Ph.D. Singapore, SINGAPORE Corresponding Author Secondary Information: Corresponding Author's Institution: Corresponding Author's Secondary Institution: First Author: Zhao Xu, Ph.D. First Author Secondary Information: Order of Authors: Zhao Xu, Ph.D. qing song Danwei Wang Order of Authors Secondary Information: Response to Reviewers: Replies to reviewer : Comments: For simplicity of notations, the time subscripts of several parameters are ignored. However, the time subscripts should be kept in some definitions such as ), ) to differentiate variables and constants. Replies: the time subscripts have been added in the equations, please see Page and Page. Comments: Spacing is irregular between paragraphs and eqs, e.g. Page. Replies: Page has been revised. Comments: In the simulation, the author simply shows the results without detailed explanation the reason why RRSPSA gets better performance compared with other algorithms. Replies: Simulation results show that the BP training algorithm performs worst as compared to other methods. The reason is that the second partial derivatives are not considered in the BP training. For RTRL, the slower convergence compared with RRSPSA is mainly caused by the small fixed learning rate and absent of the recurrent hybrid adaptive parameter for guaranteed weight convergence in the presence of noise. RRSPSA utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Please see Page 0. Comments: The paper seems only considers the synaptic dynamic, does the author considers the neuron dynamics? Replies: Neural engineers have usually studied only the synaptic dynamic system with weight updating or the neural dynamic system with neuron state changing independently, which are equally important. However, since the celebrated Powered by Editorial Manager and Preprint Manager from Aries Systems Corporation

3 backpropagation training algorithm has been successfully applied to multilayered RNNs like the Jordan and Elman networks, taxonomy boundaries between studies of the two dynamic systems become somehow fuzzy. It should be pointed that our study does not deal with the neural dynamic system because the speed of weight convergence is usually much slower than that of neurons. Powered by Editorial Manager and Preprint Manager from Aries Systems Corporation

4 Response to Reviewer Comments Comments: For simplicity of notations, the time subscripts of several parameters are ignored. However, the time subscripts should be kept in some definitions such as ), ) to differentiate variables and constants. Replies: the time subscripts have been added in the equations, please see Page and Page. Comments: Spacing is irregular between paragraphs and eqs, e.g. Page. Replies: Page has been revised. Comments: In the simulation, the author simply shows the results without detailed explanation the reason why RRSPSA gets better performance compared with other algorithms. Replies: Simulation results show that the BP training algorithm performs worst as compared to other methods. The reason is that the second partial derivatives are not considered in the BP training. For RTRL, the slower convergence compared with RRSPSA is mainly caused by the small fixed learning rate and absent of the recurrent hybrid adaptive parameter for guaranteed weight convergence in the presence of noise. RRSPSA utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Please see Page 0. Comments: The paper seems only considers the synaptic dynamic, does the author considers the neuron dynamics? Replies: Neural engineers have usually studied only the synaptic dynamic system with weight updating or the neural dynamic system with neuron state changing independently, which are equally important. However, since the celebrated backpropagation training algorithm has been successfully applied to multilayered RNNs like the Jordan and Elman networks, taxonomy boundaries between studies of the two dynamic systems become somehow fuzzy. It should be pointed that our study does not deal with the neural dynamic system because the speed of weight convergence is usually much slower than that of neurons.

5 Manuscript Click here to download Manuscript: A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for R Click here to view linked References Noname manuscript No. will be inserted by the editor) A Robust Recurrent Simultaneous Perturbation Stochastic Approximation Training Algorithm for Recurrent Neural Networks Zhao Xu Qing Song Danwei wang Received: date / Accepted: date Abstract Training of recurrent neural networks RNNs) introduces considerable computational complexities due to the need for gradient evaluations. How to get fast convergence speed and low computational complexity remains a challenging and open topic. Besides, the transient response of learning process of RNNs is a critical issue, especially for on-line applications. Conventional RNNs training algorithms such as the backpropagation through time BPTT) and real-time recurrent learning RTRL) have not adequately satisfied these requirements because they often suffer from slow convergence speed. If a large learning rate is chosen to improve performance, the training process may become unstable in terms of weight divergence. In this paper, a novel training algorithm of RNN, named robust recurrent simultaneous perturbation stochastic approximation RRSPSA), is developed with a specially designed recurrent hybrid adaptive parameter and adaptive learning rates. RRSPSA is a powerful novel twin-engine simultaneous perturbation stochastic approximation SPSA) type of RNN training algorithm. It utilizes three specially designed adaptive parameters to maximize training speed for a recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements as the original SPSA algorithm. The RRSPSA is proved with guaranteed weight convergence and system stability in the sense of Lyapunov function. Computer simulations were carried out to demonstrate applicability of the theoretical results. Zhao Xu School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore Tel.: xuzh000@e.ntu.edu.sg Qing Song School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore Tel.: eqsong@e.ntu.edu.sg Danwei Wang School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore Tel.: EDWWANG@e.ntu.edu.sg

6 Zhao Xu et al Main Notation List k Discrete time step. m Output dimension. n I Input dimension. n h Hidden neuron number. p v = m n h. Dimension of output layer weight vector. p w = n i n h. Dimension of hidden layer weight vector. l Number of time-delayed outputs. u i k) i =,...,m. External input at time step k. xk) R n I. Neural network input vector. xk) R n I. Perturbation vectors on input xk ). yk) R m. Desired output vector. ŷk) R m. Estimated output vector. εk) R m. Total disturbance vector. ek) R m. Output estimation error vector. V k) R pv. Estimated weight vector of output layer. V k) R pv. Ideal weight vector of output layer. V k) R pv. Estimated error vector of output layer. Wk) R pw. Estimated weight vector of hidden layer. W k) R pw. Ideal weight vector of hidden layer. Wk) R pw. Estimated weight vector of hidden layer. Ŵ j,: k) R n i. The jthe row vector of hidden layer weight matrix Ŵk). δ v k) Equivalent approximation errors of the loss function of output layer. δ w k) Equivalent approximation errors of the loss function of hidden layer. ck) Perturbation gain parameter of the SPSA. v k) R pv. Perturbation vectors of output layer. r v k) R pv. Perturbation vectors of output layer. w k) R pw. Perturbation vectors of hidden layer. r w k) R pw. Perturbation vectors of hidden layer. H Wk),xk)) = Hk) R m pw. Nonlinear activation function matrix. h j k) = h Ŵ j,: k)xk) ). The nonlinear activation function and the scalar element of H V k),xk)). α v k) Adaptive learning rate of output layer. α w k) Adaptive learning rate of hidden layer. α v,α w Positive scalars. ρ v k) Normalization factor of output layer. ρ w k) Normalization factor of hidden layer. β v k) Recurrent hybrid adaptive parameter of output layer.

7 Title Suppressed Due to Excessive Length β w k) Recurrent hybrid adaptive parameter of hidden layer. µ j k) R; j n h. Mean value of the input vectors of the jth hidden layer neuron. µ j k) R; j n h. Mean value of the hidden layer weight vector of the jth hidden layer neuron. τ Positive scalar. λ Positive gain parameter of the threshold function. η A small perturbation parameter. Introduction Recurrent neural networks RNNs) are inherently dynamic nonlinear systems which have signal feedback connections and are able to process temporal sequences. RNNs with free synaptic weights are more useful for complex dynamical systems such as nonlinear dynamic system identification, time-series prediction, and adaptive control as compared to Hopfield neural networks HNNs), in which the weights are fixed and symmetric [ ]. To maximize potentials and capabilities of the RNNs, one of the preliminary tasks is to design an efficient training algorithm with fast weight convergence speed and low computational complexity. Gradient descent optimization techniques are often used for neural networks which consist of adjustable weights along the negative direction of error gradient. However, due to the dynamic structure of RNNs, the obtained error gradient tends to be very small in a classical BP training algorithm with the fixed learning step, because RNN weight depends on not only the current output but also output of past steps. It is widely acknowledged that recurrent derivative of the multilayered RNNs, i.e. the recurrent training signal see Fig. ), which contains information of dynamic time dependence in training, is necessary to obtain good time response of RNNs [, 0]. From computational efficiency and weigh convergence points of view, Spall proposed a simultaneous perturbation stochastic approximation SPSA) method to find roots and minima of neural network functions which are contaminated with noise in []. The name of simultaneous perturbation arises from the fact that all elements of unknown weight parameters are being varied simultaneously. SPSA has the potential to be significantly more efficient than the usual p- dimensional algorithms of Kiefer-Wolfowitz/Blum type) that are based on standard finite-difference gradient approximations and it is shown in [] that approximately the same level of estimation accuracy can typically be achieved with only l/pth the amount of data needed in the standard approach. The main advantage of SPSA is its simplicity and excellent weight convergence property, in which only two objective function measurements are used. Thus, SPSA is more economical in terms of loss function evaluation which is usually the most expensive part of an optimization problem. SPSA has attracted great attention in application areas such as neural networks training, adaptive control and model parameter estimation [, ].

8 Zhao Xu et al Conventional training methods of RNNs are usually classified as backpropagation through time BPTT) [, ] and real time recurrent learning RTRL) [, 0] algorithms which often suffer from drawbacks of slow convergence speed and heavy computational load. In fact, BPTT cannot be used for online application due to computational problems. For a standard RTRL training, recursive weight updating requires a small learning step to obtain weight convergence and system stability of RNNs. If a large learning rate is selected to speed up weight updating, the training process of RTRL may lead to a large steady-state error or even weight divergence [, ]. In the previous work, we have developed an adaptive SPSA based training scheme for FNNs and demonstrated that the adaptive SPSA algorithm can be used to derive a fast training algorithm and guaranteed weight convergence []. In this paper, we will extend our SPSA research to the RNNs with free weight updating and propose a robust recurrent simultaneous perturbation stochastic approximation RRSPSA) algorithm under the framework of deterministic system with guaranteed weight convergence. Compared with FNNs, RNNs are dynamic systems and contain recurrent connections in their structure. Considering the time dependence of the signals of RNNs, not only the weights at the current time steps are to be perturbed, but also those at the previous time steps, which makes the learning easy to be diverge and the convergence analysis complicated to obtain. The key characteristic of RRSPSA training algorithm is to use a recurrent hybrid adaptive parameter together with adaptive learning rate and dead zone concept. The recurrent hybrid adaptive parameter can automatically switch off recurrent training signal whenever the weight convergence and stability conditions are violated see Fig. and explanation in Section III and IV). There are several advantages to train RNNs based on RRSPSA algorithm. First, RRSPSA is relatively easy to implement as compared to other RNN training methods such as BPTT due to its RTRL type of training and simplicity. Second, RRSPSA provides excellent convergence property based on simultaneous perturbation of weight, which is similar to the standard SPSA. Finally, RRSPSA has capability to improve RNN training speed with guaranteed weight convergence over the standard RTRL algorithm by using adaptive learning method. RRSPSA is a powerful twin-engine SPSA type of RNN training algorithm. It utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements as a standard SPSA algorithm. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Robust stability and weight convergence proofs of RRSPSA algorithm are provided based on Lyapunov function. The rest of this paper is organized as follows. In Sections II, we briefly introduce the structure of RNN and the conventional SPSA algorithm. In Sections III and IV, the RRSPSA algorithm and robustness analysis are derived for the output and hidden layers, respectively. Computer simulation results are presented in Section V to show the efficiency of our proposed RRSPSA. Section VI draws the final conclusions.

9 Title Suppressed Due to Excessive Length Basics of the RNN and SPSA To simplify notation of this paper, we consider a three-layer m-external input and m-output RNN with the output feedback structure as shown in Fig., which can be easily extended into a RNN with arbitrary numbers of inputs and outputs. Output vector of the RNN is ŷk) which can be presented as ŷk) = H Wk), xk)) V k) = [ŷ k),, ŷ m k)] T R m ) where V k) R pv p v = m n h ) is the long vector of the output layer weight matrix ˆV k) R n h m defined by V k) = [ ˆV, k),, ˆV nh,k),, ˆV,m k),, ˆV nh,mk) ] T and xk) R n I n I = l + ) m) is the input vector of the RNN defined by xk) = [x k), x k),, x ni k)] T = [ ŷ T k ),, ŷ T k l), u k),, u m k) ] T ) and Ŵk) R pw p w = n I n h ) is the long vector of the hidden layer weight matrix Ŵk) R n h n I defined by Wk) = [ Ŵ,: k), Ŵ,: k),, Ŵ nh,:k) ]. ) H Wk), xk)) R m pv is the nonlinear activation block-diagonal matrix given by h k),,h nh k), 0, 0,. H k) = H Wk), xk)) = 0,.... ) 0, 0, h k),,h nh k) where h j k) with j n h is one of the most popular nonlinear activation functions, h j k) = h Ŵ j,: k)xk) ) = ) + exp λŵ j,: k)xk) ) ) with Ŵ j,: k) R n I being the jth row vector of Ŵk) and λ > 0 is the gain parameter of the threshold function that is defined specifically for ease of use later when deriving the sector condition of the hidden layer. The output vector ŷk) is an approximation of the ideal nonlinear function yk) = H W,xk)) V with V and W being the ideal output layer and hidden layer weight respectively. SPSA algorithm is to solve the problem of finding an optimal weight θ of the gradient equation of neural networks ĝ ˆθk)) = L ˆθk)) ˆθk) for some differentiable loss function L : R p R, where ˆθk) is the estimated weight vector of neural networks. The basic idea is to vary all elements of the vector ˆθk) = 0

10 Zhao Xu et al. Fig. Block diagram of RRSPSA training for the output feedback based RNN. simultaneously and approximate the gradient function using only two measurements of the loss function ) and ) without requiring exact derivatives or a large number of function evaluation which provides significant improvement in efficiency over the standard gradient decent algorithms. Define ε + k) g + θ k)) = Lθ k) + ck) k)) + ε + k) ) g θ k)) = Lθ k) ck) k)) + ε k) ) ε k) and represent measurement noise terms, ck) is a scalar and where k) = [ k),..., p k)] is the perturbation vector. Then, the form of the estimation of gθ ) at the kth iteration is h g θ k)) = g + θ k)) g θ k)) ck) k)... g + θ k)) g θ k)) ck) p k) it ) Theory and numerical experience in [] indicate that SPSA can be significantly more efficient than the standard finite difference-based algorithms in large-dimensional problems. Inspired by the excellent economical and convergence properties of SPSA, we propose the RRSPSA algorithm under the framework of deterministic system based on the idea of simultaneous perturbation with guaranteed weight convergence. The loss function for RRSPSA is defined as Lθ k)) = kek)k ek) = yk) y k) + ε k) θ, 0) ) where yk) is the ideal output and a function of y k) is a function of θ k), and ε k) is a disturbance term. Similar to SPSA, the RRSPSA algorithm seeks to find an optimal weight vector θ of gradient equation, i.e., the weight vector that minimized

11 Title Suppressed Due to Excessive Length differentiable L ˆθk)). That is, the proposed RRSPSA algorithm for updating ˆθk) R pw +p v) as an estimation of the ideal weight vector θ is of the form ˆθk + ) = ˆθk) αk + )ĝ ˆθk)) ) where αk + ) is an adaptive learning rate and ĝ ˆθk)) is an approximation of the gradient of the loss function. This approximated gradient function normalized) is of the SPSA form ĝ ˆθk)) = L ˆθk) + ck + ) k + ) ) L ˆθk) ck + ) k + ) ) + δk) rk+) ck + )ρk + ) ) where δk) is an equivalent approximation error of the loss function, which is also called the measurement noise, and ρ+) is the normalization factor. k +), rk + ), and ck + ) > 0 are the controlling parameters of the algorithm defined as. The perturbation vector k + ) = [ k + ),, p w +p vk + )]T R pw +p v ) is a bounded random directional vector that is used to perturb the weight vector simultaneously and can be generated randomly.. The sequence of rk + ) is defined as p w +p ) [ ] rk + ) = k + ),, T R pw +p v). ) vk + ). ck +) > 0 is a sequence of positive numbers [,,]. ck +) can be chosen as a small constant number for a slow time-varying system as pointed in [] and it controls the size of perturbations. Note that for simplicity of notations, we ignore the time subscripts of several parameters where no confusion is caused such as k + ), rk + ), ck + ), δk) and αk + ). Output Layer Training and Stability Analysis In a multilayered neural network, the output layer weight vector V k) and hidden layer weight vector Wk) are normally updated separately using different learning rates as shown in Fig. as in the standard BP training algorithm []. During the updating of the output layer weights, the hidden layer weights are fixed. We are now ready to propose the RRSPSA algorithm to update the estimated weight vector V k) of output layer. That is, V k + ) = V k) α v k + )ĝ v V k)) ) where ĝ v V k)) R pv is the normalized gradient approximation that uses simultaneous perturbation vectors v k + ) R pv and r v k + ) R pv to stimulate weight.

12 Zhao Xu et al v k +) and r v k +) are, respectively, the first p v components of perturbation vectors defined in ) and ). ĝ v V k)) = e T k)hk) v r v ρ v ) β v k + )e T k)ak)r v cρ v k + )) ) where Ak) R m is the extended recurrent gradient defined as Ak) = H Wk), ˆD v k) V k) + c v ) ) V k) H Wk), ˆD v k) V k) c v ) ) V k) ) [ T ˆD v k) = H T k ),, H T k l),0,, 0]. ) For simplicity of notations, we ignore the time subscripts of v k + ), r v k + ), β v k + ), ρ v k + ) and α v k + ) in the following part of the paper. Remark Our purpose is to find the desired output layer weight vector V of the gradient equation ĝ V k)) = L V k)) = 0. V k) Inspired by the excellent properties of SPSA, we obtain ĝ V k)) by varying all elements of unknown weight parameters simultaneously as shown in the proof of Lemma under the framework of deterministic system. We will show that RRSPSA has the same form as SPSA, and only two objective function measurements are used at each iteration, which maintains the efficiency of SPSA. However, the weight convergence proof and stability analysis are quite different from those in [] under the framework of deterministic system as shown in Theorem and. For RNNs, the outputs depend on not only the weights at current time steps, but also those at the previous time steps since the current input vector contains the delayed outputs. The first term on the right side of equation in ) is the result of perturbation of the feedforward signals in RNNs and Ak) in ) is the result of perturbation of the recurrent signals which makes the training complicated and the system easy to be unstable. In addition to the simplicity and efficiency of RRSPSA, one powerful aspect of the developed RRSPSA algorithm is that it is able to use effectively large adaptive learning rate α v to accelerate weight convergence speed with guaranteed system stability. To avoid weight divergence at each time step, an adaptive learning rate β v is used in ) to control the recurrent signals and switch off the recurrent connections whenever the convergence condition is violated as shown in ). The normalization factor ρ v is also used to bound the signals in learning algorithms as traditionally did in adaptive control system [,]. α v is to guarantee the weight convergence during training, which means the weights are only updated when the convergence and stability requirements are met and it does not need to be small as in RTRL training algorithm. The sufficient conditions for the three key parameters are shown in ), ), and ). ˆD v k) is obtained during the mathematical deduction in the proof of Lemma. Lemma The normalized gradient approximation in ) is an equivalent presentation of the standard SPSA algorithm in equation ).

13 Title Suppressed Due to Excessive Length Proof We will prove that ) is obtained by perturbing all the weights simultaneously as SPSA did. Rewrite the loss function defined in 0) as L ˆθk) ) = L V k), Wk)) = yk) ŷk) + εk). 0) Further, we use ŷk) = Hk) V k) in ) to define ŷ v+ V k)) =H V k) + c v ) + H W,xk) + xk + ))) V k) =H V k) + c v ) + H W, [ŷt k ),, ŷ T k l), 0,, 0 ] ) ) T + xk + ) V k) =H V k) + c v ) + H W,[ V T k )H T k ),, T ) V T k l)h T k l),0,, 0] ) + xk + ) V k) [ =H V k) + c v ) + H W, V k ) + c v ) T H T k ), ] ), V k l) + c v ) T H T k l), 0,, 0 T V k). ŷ v+ V k)) has the same motivation as g + ˆθk)) of SPSA shown as in ) which is to perturbs all the weights simultaneously. Different from FNN training which only perturb the current weight, the input vector must be perturbed as well for RNN training because of the time dependence in RNNs. The disturbance xk + ) is to perturb the weights at previous time steps which are contained in xk) as seen in ). However, the last equation of ) is difficult to implement. Here we assume the model parameters do not change apparently between each iteration, i.e. V k) V k ) V k l). We can derive a similar approach as RTRL introduced by Williams and Ziper [, 0]. Moreover, such approximation makes the proposed algorithm real time and adaptive in a recursive fashion. Under the approximation, ŷ v+ V k)) H k) V k) + c v ) + H [ Wk), V k) + c v ) T H T k ), ] ), V k) + c v ) T T H T k l), 0,, 0 V k) [ ] T =H k) V k) + c v ) + H Wk), H T k ),, H T k l),0,, 0 ) V k) + c v ) V k) =H k) V k) + c v ) + H Wk), ˆD v k) V k) + c v ) ) V k) ) )

14 0 Zhao Xu et al where ˆD v k) is defined in ). By the approximation V k) V k ) V k l), the recurrent term H T k ),, H T k l) of ) can be iteratively obtained from the previous steps to speed up the calculation of RRSPSA normally, the input uk) is independent from weight so that the last entries of ˆD v k) are zero vectors). Weight convergence and system stability can be guaranteed by using the three adaptive learning rates which will be explained detaily later. Similar to ), we can get so that ŷ v V k)) = Hk) V k) c v ) + H Wk),xk) xk + ))) V k) ) Hk) V k) c v ) + H Wk), ˆD v k) V k) c v ) V k) ) ŷ v+ V k)) ŷ v V k)) V k))hk)c v + Ak) ) where the extended recurrent gradient Ak) is defined as ). Ak) in ) is the result of recurrent connections. To guarantee weight convergence and system stability during training, we insert an adaptive recurrent hybrid parameter β v k + ) see Fig. ) to control the recurrent training signal. As shown in ), the convergence condition will be examined at each step and the recurrent contribution will be switched off whenever the convergence condition is violated. Thus, ) can be rewritten as follows ŷ v+ V k)) ŷ v V k)) Hk)c v + β v Ak). ) According to ), we can get ĝ v V k)) = L Wk), V k) + c v ) L Wk), V k) c v ) + δ v k) cρ v yk) ŷ v+ V k) ) yk) ŷ v V k) ) + δ v k) = cρ { v yk) = ŷ v+ V k) ) + yk) ŷ v V k) )) T ŷ v V k) ) ŷ v+ V k) )) } + δ v k) r v cρ v) =yk) ŷk) + εk)) T ŷ v V k) ) ŷ v+ V k) )) r v cρ v) =e T k) ŷ v V k) ) ŷ v+ V k) )) r v cρ v) e T k)h k) v r v ρ v) β v e T k)ak)r v cρ v). ek) in the fifth equality of ) comes from ). The fourth equality of ) reveals the relationship between gradient approximation error δ v k) and system disturbance r v r v )

15 Title Suppressed Due to Excessive Length εk), i.e., δ v k) = ŷk) + ŷ v+ V k)) + ŷ v V k)) + εk) ) T ŷ v V k)) ŷ v+ V k)) ) = ε T k) ŷ v V k)) ŷ v+ V k)) ). In the fifth equality of ), the gradient approximation presentation appears exactly the same as the original one in [] see [, eq..)]). Only two objective function measurements are used at each step which maintains the efficiency of SPSA. The training error ek) comes from the weight estimation error and the disturbance. For stability analysis, ek) will be linked to the weight estimating error of the output layer and the disturbance. Define the weight estimating errors of the output layer and hidden layer as V k) = V V k) ) ) Wk) = W Wk) ) where W and V are the ideal optimal) weight vectors of Wk) and V k). According to ), the training error can be rewritten as ek) = yk) ŷk) + εk) = H W,xk)) V Hk) V k) + εk) = H W,xk)) V Hk) V + Hk) V Hk) V k) + εk). Because the term H W,xk)) V Hk) V is temporarily bounded during output layer training hidden layer weights are fixed), we define ẽ v k) = H W,xk)) V Hk) V + εk) = H Wk), W,xk)) V + εk) with H Wk), W,xk)) = H W,xk)) Hk). The training error caused by the weight estimation error of the output layer is defined as Then, 0) can be rewritten as 0) ) φ v k) = Hk) V k). ) ek) = φ v k) + ẽ v k). ) Hence, the training error is directly linked to the weight estimating error of the output layer contained in φ v k) and the disturbance ẽ v k). This is critical for the convergence proof. Theorem If the output layer of the RNN is trained by RRSPSA algorithm ), V k) is guaranteed to be convergent in the sense of Lyapunov function V k + ) V k) 0, k with V k) = V V k) and the training error will be L -stable in the sense of ẽ v k) L and φ v k) L.

16 Zhao Xu et al Proof See Section. The conditions for the three adaptive learning rates in ) and ) are as follows. The detailed procedure to obtain these conditions is shown in the proof of Theorem.. ρ v k + ) is the normalization factor defined by { ρ v k + ) = τρ v α v } Zk) k) + max p v + 0.β v +, ρ Y k) where 0 < τ <, 0 < α v, and ρ are positive constants and Y k) = rv ) T ηi + H T k)hk) ) H T k)ak) c where η is a small perturbation parameter. ) ) Zk) = Hk)c v + β v Ak)) r v c) ). ). α v k + ) is an adaptive learning rate similar to dead zone approach in the classical neural network and adaptive control { α v k + ) = α v, if ek) Λk) α v ) k + ) = 0, if ek) < Λk) with Λk) = ẽ v mk)/ αv ρ v ) Zk) p v +0.β v Y k) and ẽv mk) being the maximum value of ẽ v k) defined in ).. β v k + ) is the recurrent hybrid adaptive parameter determined by { β v k + ) =, ify k) 0 β v ) k + ) = 0, ify k) < 0. Remark According to the theoretical analysis, three parameters β v k+), α v k+ ) and ρ v k+) play a key role in the proof of weight convergence and stability analysis. They are designed to guarantee the negativeness of Lyapunov function as in ). The function of β v k +) is to automatically switch off the important recurrent training signal in the lower left corner of Fig. when the convergence and stability requirements are violated resulted from recurrent training signals. The normalization factor ρ v k + ) is also used to bound the signals in learning algorithms as traditionally did in adaptive control system [, ]. α v k + ) is to guarantee weight convergence during training, which means that the weights are only updated when convergence and stability requirements are met. Besides, α v k + ) does not need to be small as in RTRL training algorithm which will make the training by RRSPSA faster than that by RTRL. From the definition of α v k +) in ), it can be seen that the weights will be updated α v k + ) = α v ) only if the training error is bigger than the dead zone value which means the learning will be stopped if the training error is lower than the dead zone level. This is to provide good generalization other than to learn the system noise. In this paper, we will choose a suitable value for ẽ v mk) according to the noise level.

17 Title Suppressed Due to Excessive Length Remark Perturbation size ck + ) is a critical variable as discussed in the original SPSA papers [,, ]. A natural question is whether ck + ) will affect the learning procedure in the recurrent learning algorithm. The answer is yes. It is not obvious to see from the updating formula of ĝ v V k)) in ). However, this can be easily seen by expanding the second term Ak) in ) to be explicit in ck). According to the mean value theory and that activation function in ) is a nondecreasing function, it can be readily shown that there exist unique positive mean values µ j k) such that h Ŵ j,: k)x k) ) h Ŵ j,: k)x k) ) = µ j k)ŵ j,: k)x k) x k)) ) where Ŵ j,: k) is the estimated weight vector linked to the jth hidden layer neurons. From ) and the matrix in ), Ak) in ) can be further extended as Ak) = cωk) 0) where Ωk) R m is defined as Ω k),, Ω nh k), 0, 0,. Ωk) = 0,... V k) ). 0, 0, Ω k),, Ω nh k) and Ω j k) = µ j k)ŵ j,: k) ˆD v k) v, j =,, n h. By such expanding of Ak), ck) is explicit as shown in 0). To reveal the effect of ck), we further expand ) as ŷ v+ V k)) ŷ v V k)) H k)c v + β v k + )cωk). ) Using ) and ), we have the presentation of the system disturbance in terms of the measurement noise δ v k) of the RRSPSA algorithm as } εk) = {H k) v + β v Ωk))H k) v + β v Ωk))) T H k) v + β v Ωk))δ v k)c). Furthermore, according to the equivalent disturbance ), we have ẽ v k) = H Wk), W,xk)) V + εk) { = H Wk), W,xk)) V H k) v + β v Ωk)) } H k) v + β v Ωk))) T H k) v + β v Ωk))δ v k). c The equivalent disturbance ẽ v k) is important since its maximum value ẽ v mk) decides the range of dead zone for α v in ). Note also that the measurement noise δ v k) is defined the same way as in [], which measures difference between true gradient and gradient approximation as derived in ). Because the system disturbance εk) is ) )

18 Zhao Xu et al decided only by external conditions, a relatively smaller gain parameter c will imply a smaller measurement noise δ v k) according to ). This allows us to choose the dead zone range mainly according to the external system disturbance εk). Furthermore, as shown in ), α v will be nonzero whenever the norm of ek) is bigger than a dead zone value which is related to ẽ v mk). If the external disturbance is also small, we can choose a small dead zone range, which ensures that the adaptive learning rate α v has the nonzero value for a prolong period of training time. Hidden Layer Training and Stability Analysis The algorithm for updating estimated weight vector Wk) of the hidden layer is Wk + ) = Wk) α w k + )ĝ w Wk)) ) where ĝ w Wk)) R pw is the normalized gradient approximation that uses simultaneous perturbation vectors w k + ) R pw and r w k + ) R pw to stimulate weight of the hidden layer. w k +) and r w k +) are, respectively, the last p w components of perturbation vectors defined in ) and ). ĝ w Wk)) = et k) B + k) B k) + β w k + ) ˆD w k) ) cρ w r w k + ) ) k + ) where B + k) = H Wk) + c w k + ),xk)) V k) R m ) B k) = H Wk) c w k + ),xk)) V k) R m ) [ B ˆD w k) = H Wk), + k ) ) T,, B + k l) ) ] T ) T, 0,, 0 [ B H Wk), k ) ) T,, B k l) ) ] T )) T, 0,, 0 V k) R m. For simplicity of notations, we ignore the time subscripts of w k + ), r w k + ), β w k + ), ρ w k + ) and α w k + ) in the following part of the paper. Remark The same as the output layer, our purpose is to find the desired hidden layer weight vector W of the gradient equation ĝ Wk)) = L Wk)) Wk) = 0. The detailed mathematical deduction is given in the proof of Lemma. B + k) B k) is the result of the perturbations of the feedforward signals and ˆD w k) is the result of the perturbations of the recurrent signals in RNNs. The function of β w is to automatically switch off recurrent contribution when the convergence and stability requirements are violated. The conditions for the key parameters appeared in ) and ) are shown in ), ), and ). )

19 Title Suppressed Due to Excessive Length Lemma The normalized gradient approximation in ) is an equivalent presentation of the standard SPSA algorithm in equation ). Proof Similar to the definitions of ĝ v+ V k)) and ĝ v V k)) in ) and ), we define two perturbation functions ĝ w+ V k)) and ĝ w V k)), i.e., ĝ w+ Wk)) = H Wk) + c w,xk)) V k) + H Wk),xk) + xk + ))) V k). 0) From the definition of B + k) in ), ĝ w+ Wk)) =H Wk) + c w,xk)) V k) + H Wk),xk) + xk + ))) V k) =B + k) + H Wk), [ŷt k ),, ŷ T k l), 0,, 0 ] ) ) T + xk + ) V k) =B + k) + H Wk),[ V T k )H T Wk ) + c w,xk ) ), V T k l)h T Wk l) + c w,xk l) ) ] ) T, 0,, 0 V k) [ B B + k) + H Wk), + k ) ) T,, B + k l) ) ] T ) T,0,, 0 V k). From the definition of B k) in ), and similar to ), we get [ B ĝ w Wk)) B k)+h Wk), k ) ) T,, B k l) ) ] T ) T,0,,0 V k). Then, we have ) ) ĝ w+ Wk)) ĝ w Wk)) B + k) B k) + ˆD w k). ) ˆD w k) in ) is the result of recurrent training signal as shown in Fig.. To guarantee weight convergence and system stability during training, we add an adaptive recurrent hybrid parameter β w k + ) to control the recurrent learning signal path. That is, ĝ w+ Wk)) ĝ w Wk)) B + k) B k) + β w ˆD w k). )

20 Zhao Xu et al According to ), we can get ĝ w Wk)) = L Wk) + c w, V k)) L Wk) c w, V k)) + δ w k) cρ w r w = yk) ŷw+ k) yk) ŷ w k) + δ w k) cρ w r w = yk) ŷw+ k) + yk) ŷ w k)) T ŷ w k) ŷ w+ k)) + δ w k) cρ w = yk) ŷ w+ k) ŷ w k) + ŷ w+ k) + ŷ w k) ŷk) + εk) ŷ w k) ŷ w+ k) ) + δ w k) ) r w cρ w ) = yk) ŷk) + εk)) T ŷ w k) ŷ w+ k) ) r w cρ w ) = e T k) ŷ w k) ŷ w+ k) ) r w cρ w ) e T k) B + k) B k) + β w ˆD w k) ) r w cρ w ) ) where the equivalent disturbance δ w k) is defined as δ w k) = ŷ w Wk)) ŷ w+ Wk)) ) T ŷw εk) + Wk)) ŷ w+ Wk)) ) T ŷ w+ Wk)) + ŷ w Wk)) ŷk) ). The fifth equality in ) reveals the relationship between δ w k) and εk) as in ). For stability analysis, the training error will be linked to the parameter estimating error of the hidden layer and the disturbance. According to mean value theorem and that activation function is a nondecreasing function, it can be readily shown that there exist unique positive mean values µ j k) such that r w ) T ) h W j,:xk) ) h Ŵ j,: k)xk) ) = µ j k) W j,: Ŵ j,: k) ) xk) = µ j k) W j,: k)xk) ) where Ŵ j,: k), W j,:, W j,: k) are the estimated weight, ideal weight, and weight estimate error vectors linked to the jth hidden layer neurons, respectively. From ) and the matrix in ), it can be shown that H Wk), W,xk)) V k) = H W,xk)) Hk)) V k) = Ωk) Wk) ) where Ωk) R m pw is defined as µ k) V k)x T k) µ nh k) V nh k)x T k) µ k) V nh +k)x T k) µ nh k) V nh k)x T k) Ωk) =.. µ k) V m )nh +k)x T k) µ nh k) V p vk)x T k) = [ µ k) ˆV,:k)x T T k) µ nh k) ˆV n T h,:k)x T k) ] )

21 Title Suppressed Due to Excessive Length and ˆV T j,: k) is inverse of jth row of the estimated weight matrix ˆV k) of output layer. From ), the maximum value of the derivative of h j k), ḣ j k) = hŵ j,: k)xk)))/ Ŵ j,: k)xk)) is λ, and hence 0 µ j k) λ i n h ). We can rewrite the estimation error as ek) = yk) ŷk) + εk) where = H W,xk)) V H W,xk)) V k) + H W,xk)) V Hk) V k) + εk) = H Wk), W,xk)) V k) + H W,xk)) V k) + εk) = φ w k) + ẽ w k) 0) φ w k) = H Wk), W,xk)) V k) = Ωk) Wk) ) and the bounded equivalent disturbance ẽ w k) of the hidden layer is defined as ẽ w k) = H W,xk)) V k) + εk). ) Note that ẽ w k) is a bounded signal because V k) has been proven to be bounded in Section III. According to the definition of B + k) and B k) defined in ) and ), we can get B + k) B k) ) c) = ˆΩk) w ) where ˆΩk) R m pw is similar to Ωk) in ) except with different mean values. Theorem If the hidden layer of the RNN is trained by the adaptive RRSPSA algorithm ), weight Wk) is guaranteed to be convergent in the sense of Lyapunov function Wk + ) Wk) 0, k with Wk) = W Wk) and the training error will be L -stable in the sense of ẽ w k) L and φ w k) L. Proof See Section. The three adaptive learning rates appeared in ) and ) are as follows.. The normalization factor is defined as { where ρ w k + ) = τρ w k) + max + α w Mk) r w ϒ k), ρ Mk) = B + k) B k) + β w ˆD w k) c ϒ k) = λ min λ pw + λc β w r w ) T ηi + Ω T k) Ωk) ) Ω T k) ˆD w k) } )

22 Zhao Xu et al and 0 < α w being a positive constant, λ being the maximum value of the derivative ḣ j k) of the activation function in ), and λ min 0 being the minimum nonzero value of the mean value µ j k) defined as and 0 µ j k) λ i n h ) ) V k)x T k) V nh k)x T k) V nh +k)x T k) V nh k)x T k) Ωk) =.. V m )nh +k)x T k) V p vk)x T k) = [ ˆV T,:k)x T k) ˆV T n h,:k)x T k) ] R m pw where V j k) is the jth element of V k), j p v.. The adaptive dead zone learning rate of hidden layer is defined as { α w k + ) = α w, if ek) Γ k) α w ) k + ) = 0, if ek) < Γ k) ) with Γ = ẽ w mk)/ αw r w ρ w ϒ k) Mk) and ẽ w mk) being the maximum value of ẽ w k) defined in ).. The recurrent hybrid adaptive parameter of hidden layer β w is defined as { β w k + ) =, if r w ) T χk) 0 β w k + ) = 0, if r w ) T χk) < 0 ) ) with χk) = ηi + Ω T k) Ωk) ) Ω T k) ˆD w k). β w k +) can automatically cut off the recurrent training signal as in the lower right corner of Fig. whenever weight convergence and stability conditions of the RNN are violated as shown in Theorem of section IV. Remark Summary of the RRSPSA algorithm for RNNs.. Set a maximum training step as the stop criterion;. Initialize the weights for the RNN;. Form new input xk) as defined in equation );. Perform forward computation and calculate the output ŷk) and error ek) with the input and current estimated weights;. Calculate ρ v k + ), α v k + ), and β v k + ) using ), ), and ), respectively;. Update the output layer weights V k + ) via the learning law ) for the next iteration;. Calculate ρ w k +), α w k +), and β w k +) using ), ), and ), respectively;. Update the hidden layer weights Wk + ) via the learning law ) for the next iteration;. Go back to step to continue the iteration until the stop criterion is reached.

23 Title Suppressed Due to Excessive Length Examples. Time series prediction Sunspots are dark blotches on the sun which are caused by magnetic storms on the surface of the sun. The underlying mechanism for sunspot appearances is not exactly known. The sunspots data is a classical example of a combination of periodic and chaotic phenomena which has served as a benchmark in the statistics literature of time series and much work has been done in trying to analyze the sunspot data using neural networks [, ]. One set of the high quality America Sunspot Numbers, monthly sunspot numbers from December to October 00 in the database, is used as an example to minimize unnecessary human measurement errors. The total America sunspot number in the monthly database is points. The output sequence is scaled by dividing by 00, to get into a range suitable for the network. The hidden neuron number of the RNN is ten. The selected learning rates for BP and RTRL are 0.0 and 0.0, respectively. Perturbation step size for RRSPSA is c = 0.0. Fig. shows outputs of the plant y k) and estimated outputs yk) of the RNN trained by the three different methods. Table gives comparison results of the three methods in terms of MSE and STD. Table Comparison of Mean Squared Error and Standard Deviation of the Training Error. Michaelis-Menten equation Squared Error RRSPSA RTRL BP MSE.0e.0e.0e STD.e.e.e Consider a continue time Michaelis-Menten equation [] yt) ẏt) = 0. yt) + ut) ). + yt) Given an unit impulse as input ut), a fourth-order Runge-Kutta algorithm is used to simulate this model with integral step size t = 0.0, and 000 equi-spaced samples are obtained from input and output with a sampling interval of T = 0.0 time units. The measured noise is white Gaussian noise with mean 0 and variance 0.0. An RNN is utilized to emulate the dynamics of the given system and approximate the system output in ). Gradient function of this model has many local minima and the global minimum has a very narrow domain of attraction []. We compare performance of the RRSPSA to that of both BP and RTRL training algorithms. The hidden neuron number of RNNs in this example is selected as ten. The fixed learning rates for BP and RTRL are 0.0 and 0.0, respectively. The perturbation size c = 0.0 is selected for RRSPSA. Fig. shows outputs of the plant y k) and estimated outputs yk) of

24 0 Zhao Xu et al Table Comparison of the Mean Squared Error and Standard Deviation of the Error RRSPSA RTRL BP MSE.e.e.e STD.e.e.e the RNN trained by the three different methods. Table shows the comparison results of the three methods in terms of MSE and STD. Simulation results show that the BP training algorithm performs worst as compared to other methods. The reason is that the second partial derivatives similar to that of as in ) are not considered in the BP training. For RTRL, the slower convergence compared with RRSPSA is mainly caused by the small fixed learning rate and absent of the recurrent hybrid adaptive parameter for guaranteed weight convergence in the presence of noise. RRSPSA utilizes the specifically designed three adaptive parameters to maximize training speed for recurrent training signal while exhibiting certain weight convergence properties with only two objective function measurements. Combination of the three adaptive parameters makes RRSPSA converge fast with the maximum effective learning rate and guaranteed weight convergence. Conclusion We proposed a robust recurrent simultaneous perturbation stochastic approximation RRSPSA) algorithm under the framework of deterministic system with guaranteed weight convergence. Compared with FNNs, RNNs are dynamic systems and contain recurrent connections in their structure. Considering the time dependence of the signals of RNNs, not only the weights at the current time steps are to be perturbed, but also those at the previous time steps, which makes the learning easy to be diverge and the convergence analysis complicated to obtain. The key characteristic of RRSPSA training algorithm is to use a recurrent hybrid adaptive parameter together with adaptive learning rate and dead zone concept. The recurrent hybrid adaptive parameter can automatically switch off recurrent training signal whenever the weight convergence and stability conditions are violated see Fig. and explanation in Section III and IV). There are several advantages to train RNNs based on RRSPSA algorithm. First, RRSPSA is relatively easy to implement as compared to other RNN training methods such as BPTT due to its RTRL type of training and simplicity. Second, RRSPSA provides excellent convergence property based on simultaneous perturbation of weight, which is similar to the standard SPSA. Finally, RRSPSA has capability to improve RNN training speed with guaranteed weight convergence over the standard RTRL algorithm by using adaptive learning method. Robust stability and weight convergence proofs of RRSPSA algorithm are provided based on Lyapunov function. Proof of Theorem Proof According to the training algorithm in ), we can get V k + ) = V k) + α v ĝ v V k)) 0)

25 Title Suppressed Due to Excessive Length where ĝ v V k)) is defined in ), then we have V k + ) V k) =α v ĝ v V k)) T V k) + α v ĝ v V k)) α v et k)h k)c v + β v Ak)) cρ v r v ) T V k) + α v ĝ v V k)) { } { } c e T k)h k) v r v ) T V k) + β v e T k)ak)r v ) T V k) = α v cρ v + α v ĝ v V k)) }{ } = α {e v T k)h k) V k) V T k) V k) k)) V T v r v ) T V k) ρ v ) α {β v v e T k)hk) V k) Hk) V ) T k) Hk) V k) Hk) V ) ) T k) Ak)r v ) T H T k)hk) ) } H T k)hk) V k) cρ v ) + α v ĝ v V k)) =α v p v { e T k)φ v k) } ρ v ) + α v β v { e T k)φ v k) } trace{ Hk) V ) T k) Hk) V k) Hk) V ) T ) Ak)r k) v ) T H T k)hk) ) } H T k)hk) V k) cρ v ) + α v ĝ v V k)) α v p v { e T k)φ v k) } ρ v ) + α v β v { e T k)φ v k) } { trace Ak)r v ) T ηi + H T k)hk) ) H T k)hk) V k) Hk) V k) Hk) V k) Hk) V ) ) T } k) cρ v ) + α v ĝ v V k)) =α v p v { e T k)φ v k) } ρ v ) + α v { e T k)φ v k) } β v r v ) T ηi + H T k)hk) ) H T k)ak)cρ v ) + α v ĝ v V k)) ) Note that Hk) V k) in the fourth equality is replaced with φ v k) as defined in ). A small perturbation parameter is added to approximate [H T k)hk)] and guarantee a full rank of the inverse matrix of [ηi +H T k)hk)], which is similar to a general ) T

26 Zhao Xu et al nonlinear iteration algorithm []. V k + ) V k) α v p v { e T k)ẽ v k) ek)) } ρ v ) + α v { e T k)ẽ v k) ek)) } β v r v ) T ηi + H T k)hk) ) H T k)ak)cρ v ) + ek) ) α v Hk)c v + β v Ak) r v cρ v ) α v p v ρ v ) { ẽ v k) ek) } + α v { ẽ v k) ek) } β v r v ) T ηi + H T k)hk) ) H T k)ak)cρ v ) + ek) ) α v Hk)c v + β v r v Ak) cρ v ) =α v ρ v ) {p v + 0.β v Y k)} ẽ v k) } α v ρ v ) {p v + 0.β v Y k) α v ρ v ) Zk) ek) { } α v ρ v ) {p v + 0.β v Y k)} ẽ v k) αv ρk + ) Zk) ) ek) p v + 0.β v Y k) where Yk) and Zk) are defined in ) and ), respectively. Note also that the second inequality in ) is from the fact that e T k)ẽ v k) ẽ v k) + ek). p v is the dimension of the output layer weight vector which is positive. α v is a fixed positive scalar. β v Y k) 0 due to the definition of recurrent adaptive learning rate αv Zk) p v +0.β v Y k) β v k + ) in ) and ρ v > due to definition of the normalization factor in ). The adaptive learning rate α v k + ) in ) implies that recurrent training of RRSPSA will be α v only if the output error of RNN is bigger than a predefined value otherwise it will be zero, which guarantees V k + ) V k) is bounded below zero and is not increasing. The limit of V k + ) V k) therefore exists and { } ẽ v k) αv ρk+) Zk) p v +0.Y k) ) ek) will go to zero. Then, we finally have Proof of Theorem V k + ) V k) 0. Proof Consider that SPSA is an approximation of the gradient algorithm [], and using the property of local minimum points of the gradient e T k)ek) )) / Wk)) ) T Wk) 0 )

27 Title Suppressed Due to Excessive Length [], we have 0 e T k)ek) ) Wk) Wk)) T n h = {ḣi k)e T k) ˆV i,:k)x T T k) W } i,: k) = i= n h i= λ λ min { ḣi k) µ i k) { n h i= = λ λ min e T k) µ i k)e T k) ˆV i,:k)x T T k) W ) } i,: k) µ i k)e T k) ˆV T i,:k)x T k) W i,: k) Ωk) Wk)) = λ λ min e T k)φ w k) where λ is the maximum value of derivative ḣ j k) of activation function in ), and λ min 0 is the minimum nonzero value of mean value µ j k) defined in ). Note that the inequality is always true if λ min = 0, which implies ḣ j k) = µ j k) = 0.) According to the training algorithm for hidden layer in ), we get } ) Wk + ) = Wk) + α w ĝ w Wk)) ) where ĝ w Wk)) is defined in ). Using ), ) and ), by the definition of Ωk) in ), we get Wk + ) Wk) =α w ĝ w Wk)) Wk) + α w ĝ w Wk)) α w e T k) B+ k) B k)) + β w ˆD w k) c w r w ) T Wk) + α w ĝ w Wk)) n h { = µ i k)e T k) ˆV i,:k)x T T k) W i,: k)} { } W k) T Wk) W k)) T w r w ) T Wk) i= αw w αw e T k) Wk)) Ωk) Wk)) [ T Wk)) Ωk) Ωk) Ωk) Wk) ) ] T ˆD w k)r w ) T Wk)β w c w ) + α w ĝ w Wk)) { n h µ i k) = µ i k)e T k) ˆV T i= µ i k) i,:k)x T k) W ) } i,: k) p w α w w ) { n h + µ i k)e T k) ˆV T µ i k) i,:k)x T k) W ) } i,: k) α w c w ) i=

I. MAIN NOTATION LIST

I. MAIN NOTATION LIST IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 19, NO. 5, MAY 2008 817 Robust Neural Network Tracking Controller Using Simultaneous Perturbation Stochastic Approximation Qing Song, Member, IEEE, James C. Spall,

More information

4. Multilayer Perceptrons

4. Multilayer Perceptrons 4. Multilayer Perceptrons This is a supervised error-correction learning algorithm. 1 4.1 Introduction A multilayer feedforward network consists of an input layer, one or more hidden layers, and an output

More information

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera Intelligent Control Module I- Neural Networks Lecture 7 Adaptive Learning Rate Laxmidhar Behera Department of Electrical Engineering Indian Institute of Technology, Kanpur Recurrent Networks p.1/40 Subjects

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks Topics in Machine Learning-EE 5359 Neural Networks 1 The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension

More information

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of

More information

Nonlinear System Identification Using MLP Dr.-Ing. Sudchai Boonto

Nonlinear System Identification Using MLP Dr.-Ing. Sudchai Boonto Dr-Ing Sudchai Boonto Department of Control System and Instrumentation Engineering King Mongkut s Unniversity of Technology Thonburi Thailand Nonlinear System Identification Given a data set Z N = {y(k),

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Temporal Backpropagation for FIR Neural Networks

Temporal Backpropagation for FIR Neural Networks Temporal Backpropagation for FIR Neural Networks Eric A. Wan Stanford University Department of Electrical Engineering, Stanford, CA 94305-4055 Abstract The traditional feedforward neural network is a static

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH Abstract POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH A.H.M.A.Rahim S.K.Chakravarthy Department of Electrical Engineering K.F. University of Petroleum and Minerals Dhahran. Dynamic

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Christian Mohr

Christian Mohr Christian Mohr 20.12.2011 Recurrent Networks Networks in which units may have connections to units in the same or preceding layers Also connections to the unit itself possible Already covered: Hopfield

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1 Connectionist Models Consider humans:

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017 COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines COMP9444 17s2 Boltzmann Machines 1 Outline Content Addressable Memory Hopfield Network Generative Models Boltzmann Machine Restricted Boltzmann

More information

Multilayer Neural Networks

Multilayer Neural Networks Multilayer Neural Networks Multilayer Neural Networks Discriminant function flexibility NON-Linear But with sets of linear parameters at each layer Provably general function approximators for sufficient

More information

Lecture 5: Recurrent Neural Networks

Lecture 5: Recurrent Neural Networks 1/25 Lecture 5: Recurrent Neural Networks Nima Mohajerin University of Waterloo WAVE Lab nima.mohajerin@uwaterloo.ca July 4, 2017 2/25 Overview 1 Recap 2 RNN Architectures for Learning Long Term Dependencies

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz In Neurocomputing 2(-3): 279-294 (998). A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models Isabelle Rivals and Léon Personnaz Laboratoire d'électronique,

More information

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems Thore Graepel and Nicol N. Schraudolph Institute of Computational Science ETH Zürich, Switzerland {graepel,schraudo}@inf.ethz.ch

More information

On Node-Fault-Injection Training of an RBF Network

On Node-Fault-Injection Training of an RBF Network On Node-Fault-Injection Training of an RBF Network John Sum 1, Chi-sing Leung 2, and Kevin Ho 3 1 Institute of E-Commerce, National Chung Hsing University Taichung 402, Taiwan pfsum@nchu.edu.tw 2 Department

More information

Recursive Neural Filters and Dynamical Range Transformers

Recursive Neural Filters and Dynamical Range Transformers Recursive Neural Filters and Dynamical Range Transformers JAMES T. LO AND LEI YU Invited Paper A recursive neural filter employs a recursive neural network to process a measurement process to estimate

More information

( t) Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks

( t) Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks Identification and Control of a Nonlinear Bioreactor Plant Using Classical and Dynamical Neural Networks Mehmet Önder Efe Electrical and Electronics Engineering Boðaziçi University, Bebek 80815, Istanbul,

More information

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks What are (Artificial) Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Multi-layer Neural Networks

Multi-layer Neural Networks Multi-layer Neural Networks Steve Renals Informatics 2B Learning and Data Lecture 13 8 March 2011 Informatics 2B: Learning and Data Lecture 13 Multi-layer Neural Networks 1 Overview Multi-layer neural

More information

Multilayer Neural Networks

Multilayer Neural Networks Multilayer Neural Networks Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the

More information

Acceleration of Levenberg-Marquardt method training of chaotic systems fuzzy modeling

Acceleration of Levenberg-Marquardt method training of chaotic systems fuzzy modeling ISSN 746-7233, England, UK World Journal of Modelling and Simulation Vol. 3 (2007) No. 4, pp. 289-298 Acceleration of Levenberg-Marquardt method training of chaotic systems fuzzy modeling Yuhui Wang, Qingxian

More information

A Strict Stability Limit for Adaptive Gradient Type Algorithms

A Strict Stability Limit for Adaptive Gradient Type Algorithms c 009 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional A Strict Stability Limit for Adaptive Gradient Type Algorithms

More information

Advanced Methods for Recurrent Neural Networks Design

Advanced Methods for Recurrent Neural Networks Design Universidad Autónoma de Madrid Escuela Politécnica Superior Departamento de Ingeniería Informática Advanced Methods for Recurrent Neural Networks Design Master s thesis presented to apply for the Master

More information

inear Adaptive Inverse Control

inear Adaptive Inverse Control Proceedings of the 36th Conference on Decision & Control San Diego, California USA December 1997 inear Adaptive nverse Control WM15 1:50 Bernard Widrow and Gregory L. Plett Department of Electrical Engineering,

More information

6.4 Kalman Filter Equations

6.4 Kalman Filter Equations 6.4 Kalman Filter Equations 6.4.1 Recap: Auxiliary variables Recall the definition of the auxiliary random variables x p k) and x m k): Init: x m 0) := x0) S1: x p k) := Ak 1)x m k 1) +uk 1) +vk 1) S2:

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Adaptive Dual Control

Adaptive Dual Control Adaptive Dual Control Björn Wittenmark Department of Automatic Control, Lund Institute of Technology Box 118, S-221 00 Lund, Sweden email: bjorn@control.lth.se Keywords: Dual control, stochastic control,

More information

IN neural-network training, the most well-known online

IN neural-network training, the most well-known online IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 10, NO. 1, JANUARY 1999 161 On the Kalman Filtering Method in Neural-Network Training and Pruning John Sum, Chi-sing Leung, Gilbert H. Young, and Wing-kay Kan

More information

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units Connectionist Models Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 10 Connections per neuron ~ 10 4 5 Scene recognition time ~ :1 second 100 inference steps doesn't seem like

More information

Back-propagation as reinforcement in prediction tasks

Back-propagation as reinforcement in prediction tasks Back-propagation as reinforcement in prediction tasks André Grüning Cognitive Neuroscience Sector S.I.S.S.A. via Beirut 4 34014 Trieste Italy gruening@sissa.it Abstract. The back-propagation (BP) training

More information

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 5, SEPTEMBER 2001 1215 A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing Da-Zheng Feng, Zheng Bao, Xian-Da Zhang

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward ECE 47/57 - Lecture 7 Back Propagation Types of NN Recurrent (feedback during operation) n Hopfield n Kohonen n Associative memory Feedforward n No feedback during operation or testing (only during determination

More information

A DELAY-DEPENDENT APPROACH TO DESIGN STATE ESTIMATOR FOR DISCRETE STOCHASTIC RECURRENT NEURAL NETWORK WITH INTERVAL TIME-VARYING DELAYS

A DELAY-DEPENDENT APPROACH TO DESIGN STATE ESTIMATOR FOR DISCRETE STOCHASTIC RECURRENT NEURAL NETWORK WITH INTERVAL TIME-VARYING DELAYS ICIC Express Letters ICIC International c 2009 ISSN 1881-80X Volume, Number (A), September 2009 pp. 5 70 A DELAY-DEPENDENT APPROACH TO DESIGN STATE ESTIMATOR FOR DISCRETE STOCHASTIC RECURRENT NEURAL NETWORK

More information

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Neural Networks Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016 Outline Part 1 Introduction Feedforward Neural Networks Stochastic Gradient Descent Computational Graph

More information

Single layer NN. Neuron Model

Single layer NN. Neuron Model Single layer NN We consider the simple architecture consisting of just one neuron. Generalization to a single layer with more neurons as illustrated below is easy because: M M The output units are independent

More information

Radial-Basis Function Networks

Radial-Basis Function Networks Radial-Basis Function etworks A function is radial () if its output depends on (is a nonincreasing function of) the distance of the input from a given stored vector. s represent local receptors, as illustrated

More information

Model Reference Adaptive Control for Multi-Input Multi-Output Nonlinear Systems Using Neural Networks

Model Reference Adaptive Control for Multi-Input Multi-Output Nonlinear Systems Using Neural Networks Model Reference Adaptive Control for MultiInput MultiOutput Nonlinear Systems Using Neural Networks Jiunshian Phuah, Jianming Lu, and Takashi Yahagi Graduate School of Science and Technology, Chiba University,

More information

Analyzing the weight dynamics of recurrent learning algorithms

Analyzing the weight dynamics of recurrent learning algorithms Analyzing the weight dynamics of recurrent learning algorithms Ulf D. Schiller and Jochen J. Steil Neuroinformatics Group, Faculty of Technology, Bielefeld University Abstract We provide insights into

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks 鮑興國 Ph.D. National Taiwan University of Science and Technology Outline Perceptrons Gradient descent Multi-layer networks Backpropagation Hidden layer representations Examples

More information

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS Karima Amoura Patrice Wira and Said Djennoune Laboratoire CCSP Université Mouloud Mammeri Tizi Ouzou Algeria Laboratoire MIPS Université

More information

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network LETTER Communicated by Geoffrey Hinton Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network Xiaohui Xie xhx@ai.mit.edu Department of Brain and Cognitive Sciences, Massachusetts

More information

Multi-Robotic Systems

Multi-Robotic Systems CHAPTER 9 Multi-Robotic Systems The topic of multi-robotic systems is quite popular now. It is believed that such systems can have the following benefits: Improved performance ( winning by numbers ) Distributed

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Reservoir Computing and Echo State Networks

Reservoir Computing and Echo State Networks An Introduction to: Reservoir Computing and Echo State Networks Claudio Gallicchio gallicch@di.unipi.it Outline Focus: Supervised learning in domain of sequences Recurrent Neural networks for supervised

More information

Course 395: Machine Learning - Lectures

Course 395: Machine Learning - Lectures Course 395: Machine Learning - Lectures Lecture 1-2: Concept Learning (M. Pantic) Lecture 3-4: Decision Trees & CBC Intro (M. Pantic & S. Petridis) Lecture 5-6: Evaluating Hypotheses (S. Petridis) Lecture

More information

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

EM-algorithm for Training of State-space Models with Application to Time Series Prediction EM-algorithm for Training of State-space Models with Application to Time Series Prediction Elia Liitiäinen, Nima Reyhani and Amaury Lendasse Helsinki University of Technology - Neural Networks Research

More information

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box ImageNet Classification with Deep Convolutional Neural Networks Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Motivation Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption ANDRÉ NUNES DE SOUZA, JOSÉ ALFREDO C. ULSON, IVAN NUNES

More information

Artificial Neural Networks. Edward Gatt

Artificial Neural Networks. Edward Gatt Artificial Neural Networks Edward Gatt What are Neural Networks? Models of the brain and nervous system Highly parallel Process information much more like the brain than a serial computer Learning Very

More information

Artificial Neural Network : Training

Artificial Neural Network : Training Artificial Neural Networ : Training Debasis Samanta IIT Kharagpur debasis.samanta.iitgp@gmail.com 06.04.2018 Debasis Samanta (IIT Kharagpur) Soft Computing Applications 06.04.2018 1 / 49 Learning of neural

More information

Neural Network Identification of Non Linear Systems Using State Space Techniques.

Neural Network Identification of Non Linear Systems Using State Space Techniques. Neural Network Identification of Non Linear Systems Using State Space Techniques. Joan Codina, J. Carlos Aguado, Josep M. Fuertes. Automatic Control and Computer Engineering Department Universitat Politècnica

More information

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE LECTURE NOTE #NEW 6 PROF. ALAN YUILLE 1. Introduction to Regression Now consider learning the conditional distribution p(y x). This is often easier than learning the likelihood function p(x y) and the

More information

Introduction to Machine Learning Spring 2018 Note Neural Networks

Introduction to Machine Learning Spring 2018 Note Neural Networks CS 189 Introduction to Machine Learning Spring 2018 Note 14 1 Neural Networks Neural networks are a class of compositional function approximators. They come in a variety of shapes and sizes. In this class,

More information

Global stabilization of feedforward systems with exponentially unstable Jacobian linearization

Global stabilization of feedforward systems with exponentially unstable Jacobian linearization Global stabilization of feedforward systems with exponentially unstable Jacobian linearization F Grognard, R Sepulchre, G Bastin Center for Systems Engineering and Applied Mechanics Université catholique

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

A Hybrid Time-delay Prediction Method for Networked Control System

A Hybrid Time-delay Prediction Method for Networked Control System International Journal of Automation and Computing 11(1), February 2014, 19-24 DOI: 10.1007/s11633-014-0761-1 A Hybrid Time-delay Prediction Method for Networked Control System Zhong-Da Tian Xian-Wen Gao

More information

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network Fadwa DAMAK, Mounir BEN NASR, Mohamed CHTOUROU Department of Electrical Engineering ENIS Sfax, Tunisia {fadwa_damak,

More information

Radial-Basis Function Networks

Radial-Basis Function Networks Radial-Basis Function etworks A function is radial basis () if its output depends on (is a non-increasing function of) the distance of the input from a given stored vector. s represent local receptors,

More information

Dual Estimation and the Unscented Transformation

Dual Estimation and the Unscented Transformation Dual Estimation and the Unscented Transformation Eric A. Wan ericwan@ece.ogi.edu Rudolph van der Merwe rudmerwe@ece.ogi.edu Alex T. Nelson atnelson@ece.ogi.edu Oregon Graduate Institute of Science & Technology

More information

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C. Spall John Wiley and Sons, Inc., 2003 Preface... xiii 1. Stochastic Search

More information

Further Results on Model Structure Validation for Closed Loop System Identification

Further Results on Model Structure Validation for Closed Loop System Identification Advances in Wireless Communications and etworks 7; 3(5: 57-66 http://www.sciencepublishinggroup.com/j/awcn doi:.648/j.awcn.735. Further esults on Model Structure Validation for Closed Loop System Identification

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

Modeling and Analysis of Dynamic Systems

Modeling and Analysis of Dynamic Systems Modeling and Analysis of Dynamic Systems by Dr. Guillaume Ducard c Fall 2016 Institute for Dynamic Systems and Control ETH Zurich, Switzerland G. Ducard c 1 Outline 1 Lecture 9: Model Parametrization 2

More information

Revisiting linear and non-linear methodologies for time series prediction - application to ESTSP 08 competition data

Revisiting linear and non-linear methodologies for time series prediction - application to ESTSP 08 competition data Revisiting linear and non-linear methodologies for time series - application to ESTSP 08 competition data Madalina Olteanu Universite Paris 1 - SAMOS CES 90 Rue de Tolbiac, 75013 Paris - France Abstract.

More information

Kalman Filters with Uncompensated Biases

Kalman Filters with Uncompensated Biases Kalman Filters with Uncompensated Biases Renato Zanetti he Charles Stark Draper Laboratory, Houston, exas, 77058 Robert H. Bishop Marquette University, Milwaukee, WI 53201 I. INRODUCION An underlying assumption

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 4: Perceptrons and Multilayer Perceptrons Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons

More information

Multilayer Perceptron

Multilayer Perceptron Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Single Perceptron 3 Boolean Function Learning 4

More information

Long-Short Term Memory

Long-Short Term Memory Long-Short Term Memory Sepp Hochreiter, Jürgen Schmidhuber Presented by Derek Jones Table of Contents 1. Introduction 2. Previous Work 3. Issues in Learning Long-Term Dependencies 4. Constant Error Flow

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks Jan Drchal Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science Topics covered

More information

NONLINEAR PLANT IDENTIFICATION BY WAVELETS

NONLINEAR PLANT IDENTIFICATION BY WAVELETS NONLINEAR PLANT IDENTIFICATION BY WAVELETS Edison Righeto UNESP Ilha Solteira, Department of Mathematics, Av. Brasil 56, 5385000, Ilha Solteira, SP, Brazil righeto@fqm.feis.unesp.br Luiz Henrique M. Grassi

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global homas Laurent * 1 James H. von Brecht * 2 Abstract We consider deep linear networks with arbitrary convex differentiable loss. We provide a short and elementary proof of the fact that all local minima

More information

4.0 Update Algorithms For Linear Closed-Loop Systems

4.0 Update Algorithms For Linear Closed-Loop Systems 4. Update Algorithms For Linear Closed-Loop Systems A controller design methodology has been developed that combines an adaptive finite impulse response (FIR) filter with feedback. FIR filters are used

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units

More information

Learning Neural Networks

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex decision boundaries Variable size. Any boolean function can be represented. Hidden units can be interpreted as new features Deterministic

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

CHALMERS, GÖTEBORGS UNIVERSITET. EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

CHALMERS, GÖTEBORGS UNIVERSITET. EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD CHALMERS, GÖTEBORGS UNIVERSITET EXAM for ARTIFICIAL NEURAL NETWORKS COURSE CODES: FFR 135, FIM 72 GU, PhD Time: Place: Teachers: Allowed material: Not allowed: October 23, 217, at 8 3 12 3 Lindholmen-salar

More information

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang C4 Phenomenological Modeling - Regression & Neural Networks 4040-849-03: Computational Modeling and Simulation Instructor: Linwei Wang Recall.. The simple, multiple linear regression function ŷ(x) = a

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions

Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions Direct Method for Training Feed-forward Neural Networks using Batch Extended Kalman Filter for Multi- Step-Ahead Predictions Artem Chernodub, Institute of Mathematical Machines and Systems NASU, Neurotechnologies

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information