1 Nonlinear system modeling with deep neural networks and autoencoders algorithm Erick De la Rosa, Wen Yu Departamento de Control Automatico CINVESTAV-IPN Mexico City, Mexico Xiaoou Li Departamento de Computacion CINVESTAV-IPN Mexico City, Mexico Abstract Deep learning techniques have been successfully used for pattern classification. These advantage methods are still not applied in nonlinear systems identification. In this paper, the neural model has deep architecture which is obtained by a random search method. The initial weights of this deep neural model is obtained from the denoising autoencoders model. We propose special unsupervised learning methods for this deep learning model with input data. The normal supervised learning is used to train the weights with the output data. The deep learning identification algorithms are validated with three benchmark examples. I. INTRODUCTION System identification with neural networks falls in two tasks: structure identification and parameter identification. The structure identification often uses trial-and-error approaches [1]. However, these algorithms do not improve significantly the identification accuracy, because they only try to find the hidden neuron number and do not deal with the hidden layer number. In this paper, we use deep structure for the neural model, we use less hidden neurons in each hidden layer while increasing the number of hidden layers. This strategy does not increase the complexity of the neural model but improve its generalization capacity. The parameter identification is usually addressed by some gradient descent variants. They may converge very slowly, and usually present the local minima problem. Since the identification error space is unknown, the neural model can settle down in a local minima easily if the initial weights of the neural model are not suitable. There are some techniques to overcome the local minima in the error surface and settle down the neural model near the global minimum, such as noise-shaping modification [2] and nonlinear clustering [3]. They do not solve the basic problem of the local minima: poor initial weights. In [4], the initial weights of a recurrent neural network are calculated by the sensitivity ratio analysis, in [5], the initial weights are obtained by finding the support vectors of the input data. [6] has shown that deep learning has some capabilities to avoid local minima. In this paper, we use deep learning to find the best initial weights for the neural model. The deep structure of a neural network usually requires three or more hidden layers [7] and the depth of a neural model is its hidden layer number. A deep neural network has the same structure as a MLP. The function approximation and the learning procedure also do not change. However, the deep neural network usually need fewer parameters (weights) than a MLP [8]. On the other hand, increasing the hidden nodes number causes exponential increasing on the number of model parameters, and also requires more training examples [9][10]. There does not exist a method to find the optimal structure of a deep model. [11] proposes some effective algorithms to find suitable hidden layer number and neuron number in each layer. These methods can be classified into two categories: 1) Grid search, it applies the learning algorithm over all possible combination of the structure hyperparameters; 2) Random search, it only use a few combinations by some sampling rules. [12] has proven that the random search can obtain a similar structure complexity as the grid search for deep neural models, while the computation cost is much less than the grid search. In this paper we use the random search to find the structure of the deep neural model. The denoising autoencoders method [13] and the restricted Boltzmann machines [14] are main deep learning methods. The autoencoders method encodes the input and undo the effect of an input corruption process. It needs to add a stochastic corruption on the input. The restricted Boltzmann machines use energy-based learning models. Both of them are unsupervised learning methods. The results of [15] show that the unsupervised pretraining can drive the neural model away from the local minima for the classification problems. Time series forecasting can also use deep learning techniques [16]. The output of the predictive model is current values. While the input is previous output. In [17], the denoising autoencoder is used as pre-training stage. The prediction results are better than the methods without deep learning pretraining. [18] uses RBM approach as pre-training. However, the hidden and visible units of the RBM model are binary. The prediction results for continuous values are not so good. [19] points out denoising autoencoder method may not improve prediction results if the input from the time series is not sufficiently large. All of above RBM based time series forecasting methods use binary probability estimations. The prediction results are not satisfied for continuous time series. Deep learning methods cannot be applied to system identification directly, because the input/output values are nonbinary as classification problem, for example the conditional /16/$31.00 c 2016 IEEE

2 probability transformation in the original restricted Boltzmann machines needs binary values [6]. In order to handle gray level pixels, [7] uses an integral instead of a sum to calculate the conditional probability. To the best of our knowledge, there are no results that apply deep learning for nonlinear system identification. In this paper, we use these two deep learning methods to input data sets to obtain the initial weights of deep neural models. In this paper, we extend the idea of [7] to non-positive values and [0, ) such that system identification works. A deep neural model is first constructed for unknown nonlinear system. Then we use the autoencoders model to design unsupervised learning with input data. The structures of the autoencoders model is the deep neural model. We use the weights trained by the deep learning methods as the initial weights of the deep neural model. The gradient descent supervise learning method is applied to train the weights of the deep neural model. Finally, two benchmark examples are use to show the effectiveness of our deep learning methods for nonlinear system identification. II. NONLINEAR SYSTEM MODELING WITH DEEP NEURAL NETWORKS Consider the following unknown discrete-time nonlinear system x(k +1) = f [ x(k),u(k)], y(k) = g[ x(k)] (1) where u(k) R u is the input vector, x(k) R x is an internal state vector, and y(k) R m is the output vector. f and g are general nonlinear smooth functions f,g C. Let us now recall the following definitions. Denoting Y(k) = [ y T (k),y T (k +1), y T (k +n 1) ] T, U(k) = [ u T (k),u T (k +1), u T (k +n 2) ] T. If Y x is non-singular at x = 0, U = 0, this leads to the NARMA model y(k) = Φ[x(k)] (2) where x(k) = [y T (k 1),y T (k 2), u T (k),u T (k 1), ] T Φ( ) is an unknown nonlinear difference equation representing the plant dynamics,u(k) andy(k) are measurable scalar input and output, d is time delay. The nonlinear system (2) is a NARMA model. We can also regard the input of the nonlinear system asx(k) = [x 1 x n ] T R n, the output asy(k) R m Now we use the following multilayer neural network to identify the unknown nonlinear system (2). φ p = [ 1 m ]. The other layers use sigmoid functions as ) φ i (ω j ) = α i / (1+e βt i ωj γ i where i = 1,,p 1, j = 1,,l i, α i, β i, and γ i are prior defined positive constants, ω j are the input variables to the sigmoid functions. From the Stone-Weierstrass theorem, we know that if the node number of one hidden layer neural network is large enough, the neural model can approximate the nonlinear function Φ to any degree of accuracy for all x(k). Instead of increasing the node number l i, in this paper it is increased the layer number p. We use deep structure, i.e., p 3, for the multilayer neural model (3), such that we can use some existing deep learning technique for system identification. The goal of the neural identification is to find a suitable structure (layer number p, node number in each layer l i ) and the weights (W 1 W p ) such that the neuro identification error e(k) = ŷ(k) y(k) (4) is minimized. [9] has proven that the random search can obtain similar model structure as the grid search for the deep neural model (p 3). In this paper, it is used the random search to find the hidden layer number p, and neuron number in each layer l i (i = 1 p). The next job is to find suitable weights (W 1 W p ). Many supervised learning techniques, such as gradient descent and Hessian, can be applied to train these weights. The gradient descent method or its modification versions can always arrive the local minima with fast or low speed. This local minima completely depends on the initial conditions of the weights W 1 W p. The identification model structure using the deep learning techniques is shown in Fig. 1. In the following sections, it is shown how to use deep learning techniques to find the initial weights W 1 (0) W p (0), and how to identify the nonlinear systems. III. AUTOENCODERS METHOD FOR SYSTEM IDENTIFICATION Although the autoencoders technique is designed for classification and de-noising, in this paper we modify it for system identification. Consider the input x(k) related to the system (2), it is first mapped to a hidden representation h 1 (k) by an encoder φ 1. In this paper, we use the same weight and nonlinear active function as the identification model (3), h 1 (k) = φ 1 [W 1 x(k)+b 1 ] (5) Then the hidden representation or code h 1 (k) is mapped back ŷ(k) = φ p (W p φ p 1...W 3 φ 2 {W 2 φ 1 [W 1 x(k)+b 1 ]+b 2 }...+bto p ) a reconstruction z 1 (k) by the decoder (3) where ŷ(k) R m is the output of the neural model, z 1 (k) = φ 1 [V 1 h 1 (k)+c 1 ] (6) W 1 R l1 n, b 1 R l1, W 2 R l2 l1, b 2 R l2, W p R m lp 1, b p R m, p is the number of layers of the network, l i (i = 1,,p 1) are the node numbers in each layer, φ i R li (i = 1 p) are active vector functions. We use a linear function for the output layer φ p : R m R m, i.e., Here we use the same active function as in (5). The size of z 1 (k) should be the same as x(k). z 1 (k) can be explained as a prediction of x(k) given the code h 1 (k), or input reconstruction. The weight matrix V 1 associated with the regressive mapping is then simplified as

3 x(k) W 1,b 1 h 1 V 1,c 1 z 1 ( q) b ( q) W1, 1 h 1 ( q) c ( q) V1, 1 h 1 W 2,b 2 h 2 V 2,c 2 z 2 ( q) b ( q) W2, 2 h 2 ( q) c ( q) V2, 2 x(k) Φ y(k) Fig. 2. The autoencoders model e(k) w1(0) Lwp(0) yˆ In order to assure the learning process of the autoencoders model robust, a small noise is added to the input. In this paper, we add a zero mean white Gaussian noise as x(k) x 1 (k) = x 1 (k)+ξ 1 (k) z(k) Fig. 1. The identification structure using deep learning techniques. a transposition matrix. In the coding stage it is selected as V 1 = W1 T. For deep neural models an autoencoder model described by (5) and (6) has to be implemented for each hidden layer in (3). The parameters of the autoencoders model W 1, b 1, and c 1 associated with the layer 1 are trained to minimize the error between z 1 (k) and x(k). For classification tasks with binary or normalized inputs the following cross entropy index is applied [7] : q J 1 = {x(k) log[z 1 (k)]+[1 x(k)] log[1 z 1 (k)]} k=1 (7) whereq is the total number of training examples. For nonlinear system identification we use the following squared error cost function J 1 (k) = x(k) z 1 (k) 2 (8) This change transforms the autoencoder model into a single layer neural network. The weightsw 1 are updated by the usual gradient descent method (backpropagation) W 1 (k +1) = W 1 (k) η 1 J 1 (k) W 1 (k) where η 1 > 0 is the learning rate, k = 1,2 q, q is the total training data number. The thresholds b 1 and c 1 can be trained with the weights W 1 together, this is performed adding to the original input x(k) an extra entry of value 1. (9) If the dimension of the input is very big (n > 20), the above method does not work well. In that case as stated in [7], it is forced some entries of the input vector to be zero randomly, such that the input reconstruction process is robust. The unsupervised training for the autocoders deep model is as follows: 1) The input, the output and the hidden representation of the first model are x(k) R n, z 1 (k) R n and h 1 (k) R l1. We use q input data to train the weights of the first model W 1 R l1 n, b 1 R l1 and c 1 R n. 2) After the first model is trained, their weights are fixed. As stated in (5) and (6) W 1 and b 1, once fixed by the autoencoder training, will become the initial weights of the first layer in (3). The code or hidden representation of the first model is computed with fixed weights, h 1 (k) is considered the input of the second autoencoder model associated with layer 2. 3) The second model is then pretrained by the input h 1 (k), with reconstruction z 2 (k) R l1 and code h 2 (k) R l2, noise should be also added to the input in order to use the greedy layer wise training [7] which considers each layer as an independent entity. 4) Then we train the third model, until all p 1 models (one for each hidden layer) are pretrained. This training process is shown in Fig.2. The autoencoders model in Fig.2 has a similar structure as the identification model (3). The nonlinear system identification via deep learning includes two stages: 1) Unsupervised learning with input data, we use the pretrained weights of the autoencoders model W 1 (q) W p (q) as the initial weights of the identification model (3); 2) Supervised learning with output data, in this stage the weights are trained by the classical supervised learning method. In this paper we use the gradient descent method.

4 For system identification, we use the following square error: J 2 (k) = y(k) ŷ(k) 2 (10) where y(k) is the output of the unknown plant (2), ŷ(k) is the output of the neural model (3). The weights W i and biases b i are updated by J 2 (k) W i (k +1) = W i (k) η 2, i = 1 p (11) W i (k) where η 2 > 0 is the learning rate of the supervised learning, k = 1,2 q, q is the total training data number. The biases b i can also be trained expanding the corresponding matrix W i to contain them. According with deep learning literature, the supervised learning stage may be only applied to the output layer of the J identification model, i.e., W p (k +1) = W p (k) η 2(k) 2 W, in p(k) the other layersw i,i = 1 p 1, the weights and biases keep fixed From several simulations we found out that this simple method is not effective for system identification getting better generalization performance when all layers are updated. We use the following algorithm to identify nonlinear systems via deep learning methods: Algorithm 1: 1) Construct a deep neural network model (3) with p 3. The layer number p and the node number l i (i = 1,,p 1) are chosen by the random search method. 2) Reconstruct input with unsupervised learning. The final weights of the autoencoders model are the initial weights of the deep neural network model. This is a batch process with data size q. 3) Use the output data to train the weights of the deep neural model with supervised learning. This supervised learning can be on-line. To avoid overfitting, we use a stop criterion. Noise (or disturbance) is an important issue in the system identification. There are two types of disturbances: external and internal. Internal disturbance can be regarded as an unmodeled dynamic. External disturbance can be regarded as measurement noise, input noise, etc. In the point of deep learning, input noises are included feedforward through each layer. Measurement noise is enlarged due to backpropagation of identification error, therefore the weights of neural identification model are affected by output noise. On the other hand, a small external disturbance can accelerate the convergence rate according to the persistent excitation theory. For control theory, small disturbances in the control signal u(k) or in the output y(k) can enhance the information presented in the signal x(k), this is good for parameters convergence. IV. SIMULATIONS Gas furnace system The gas furnace dataset is a commonly used benchmark [22]. The input u(k) is the flow rate of the methane gas, while the output y(k) is the concentration of CO 2 in the gas mixture under a steady air supply. The dataset Test error Number of layer Nodes per layer Fig. 3. The structure parameters of the gas furnace. has 296 samples at a fixed interval of 9 seconds. [22] used a time-series based approach to develop a linear model. In this paper, we use the same data structure as [22], the recursive input data for the model is x(k) = [y(k 1), y(k 4), u(k), u(k 5)] T, the model output is ŷ(k). 200 samples are applied for training. In order to use the restricted Boltzmann machine (RBM), the training values of x(k) and y are normalized. The gas furnace dataset has the form of (2) with n = 10, m = 1. We use 3 types of restricted Boltzmann machine (RBM) to train the hidden weights: binary input (DN BI), interval [0, 1] (DN RB), interval [ 1,1] (DN NE). For the interval [0,1], 10 x(k) min k { x(k)} max{ x(k)} min k { x(k)}. x(k) is normalized as stated by x(k) = We use200 data to train the deep learning model. The structure parameters of the neural model, layer number p and node number of each layer l i (i = 1 p), are obtained by the random search method [12]. The results are 4 hidden layers (p = 5) and l i = 50 (i = 1,2,3,4) which yields a minimum test error when tested with a (DN NO) model as seen in Fig. 3. Our neural model (3) is: four hidden layers and one linear output layer, each hidden layer has 50 nodes.the training rate for the restricted Boltzmann machine is η 3 = 0.1 and a 1-step Gibbs sampling is use. In the supervised stage the learning rate was η 4 = It was applied only one learning epoch for both pretraining and supervised phases. We compare the RBM models with a deep neural model based on denoising autoencoders (η 1 = η 2,= 0.15, 1 epoch for supervised and unsupervised learning) (DN AM) and a neural model without a pretraining stage (DB BP) with learning rate η 2 = Both DN AM and DB BP have the same structure than DN NO (4 layers with 50 nodes per layer). In the testing phase, we define the average error as 1 N N k=1 ŷ(k) y(k) 2, N = 91. The testing results of the three models are shown in Fig.4. The average errors ( 10 5 ) of DN AM is 4.239, while DN BP is For Boltzmann machine (DN RB), the

5 Output System DN_RB DN_AM MLP DN_BP Outputs System DN_RB DN_AM MLP DN_BP Time Fig. 5. The testing results of Wiener Hammerstein system Time Fig. 4. The testing results of the gas furnace. binary case is 5.103, [0, 1) case is 4.567, [ 1, 1] case is We can see that the best performance of DN RB is to normalize the visible units into [0,1]. DN AM is better than DN RB, because RBMs need more examples, while this dataset only have 200. In the pretraining stage, DB BP obtains better initial weights. Wiener-Hammerstein system A Wiener-Hammerstein (W-H) system is a series connection of three parts: a linear system, a static nonlinearity and another linear system. The data of the Wiener-Hammerstein benchmark is generated from an electrical circuit which consists of three cascade blocks [23]. There is not direct measurement to the static nonlinearity, because it is located between two unknown linear dynamic systems. The benchmark dataset consists of 188, 000 input/output pairs. This dataset is divided in two parts [23]: 100, 000 sample pairs are for training, 88, 000 samples are for testing. Let u(k) be the input and y(k) be the output. We define the recursive input vector to the model as x(k) = [y(k 1) y(k 4) u(k) u(k 5)] T. The W-H dataset has also the mathematical structure of (2) with n = 10, m = 1. We use three types of RBMs: DN BI, DN RB and x(k) is in the interval [ 3,3] (DN NE). For the interval [0, 1], x(k) is also normalized. As suggested by [23], the first 100,000 examples are used to implement the pretraining phase of the deep learning model. The hyperparameters are sampled using the random search method [12]. The best training results were found with a structure of 4 hidden layers (p = 5) and l i = 80 (i = 1,2,3,4) which yields a minimum test error when tested with a (DN NO) model as seen in Fig. 5. The average errors ( 10 3 ) of DN AM is 3.573, DN BP is For DN RB, the binary case is, [0,1) case is 2.534, [ 3,3] case is Finally, we compare these methods with the support vector machine (SVM) [23] and multilayer perceptrons with gradient learning algorithm (MLP) [21]. The testing squared errors ( 10 3) for MLP is 56.03, linear kernel SVM is 43.01, polynomial kernel SVM is 6.01, RBF kernel is 4.71, DN BI is The RBM model is better when the input data are positive and bounded. The autoencoders model is better when the input data are not restricted. The computational cost of RBM model is almost twice as the autoencoders model. V. CONCLUSIONS In this paper, we use input data to obtain the initial weights, and the output to train the weights. The deep learning algorithms of the denoising autoencoders is modified, such that it is suitable for nonlinear system identification. As an alternative model for nonlinear systems identification, the deep neural networks have more hidden layers and less hidden nodes than MLPs. The computational complex does not increase. REFERENCES [1] I.Rivals and L.Personnaz, Neural-network construction and selection in nonlinear modeling, IEEE Transactions on Neural Networks, Vol.14, No.4, , 2003 [2] S.Chakrabartty, ; R.K.Shaga, K.Aono, Noise-Shaping Gradient Descent- Based Online Adaptation Algorithms for Digital Calibration of Analog Circuits, IEEE Transactions on Neural Networks and Learning Systems, Volume: 24, Issue: 4, pp , 2013 [3] Y.Liu, Y.Liu, K.Chan, K.A.Hua, Hybrid Manifold Embedding, IEEE Transactions on Neural Networks and Learning Systems, Volume: 25, Issue: 12, pp , 2014 [4] Q.Song, Robust Initialization of a Jordan Network With Recurrent Constrained Learning, IEEE Transactions on Neural Networks, Vol.22, No.12, pp , 2011 [5] W.Yu, X.Li, Automated Nonlinear System Modeling with Multiple Fuzzy Neural Networks and Kernel Smoothing, International Journal of Neural Systems, Vol.20, No.5, , 2010 [6] G. E. Hinton, S. Osindero, and Y. Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol. 18, pp , [7] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, Greedy layerwise training of deep networks, Advances in Neural Information Processing Systems (NIPS 06), pp , MIT Press, 2007.

6 [8] Y. Bengio and O. Delalleau, Justifying and generalizing contrastive divergence, Neural Computation, vol. 21, no. 6, pp , [9] R. Collobert and J. Weston, A unified architecture for natural language processing: Deep neural networks with multitask learning, 25th International Conference on Machine Learning (ICML 08), pp , ACM, [10] D. Erhan, P.-A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, The difficulty of training deep architectures and the effect of unsupervised pretraining, 12th International Conference on Artificial Intelligence and Statistics (AISTATS 09), pp , [11] J. Bergstra and Y. Bengio, Algorithms for Hyper-Parameter Optimization, Journal of Machine Learning Research, pp [12] J. Bergstra and Y. Bengio, Random Search for Hyper-Parameter Optimization, Journal of Machine Learning Research, pp , 2011 [13] P.Vincent, H. Larochelle, Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders, 25th International Conference on Machine Learning (ICML 08), pp , ACM, [14] G. E. Hinton and T. J. Sejnowski, Learning and relearning in Boltzmann machines, Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, pp , Cambridge, MA: MIT Press, [15] D.Erhan, Y.Bengio, A.Courville, P-A.Manzagol and P.Vincent. Why Does Unsupervised Pre-training Help Deep Learning?, Journal of Machine Learning Research, vol.11, , 2010 [16] M. Längkvist, L. Karlsson, and A. Loutfi. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recognition Letters 42: [17] P. Romeu, et al. Time-Series Forecasting of Indoor Temperature Using Pre-trained Deep Neural Networks. Artificial Neural Networks and Machine Learning ICANN Springer Berlin Heidelberg, [18] L. Qiu, L. Zhang, Y. Ren, P.N. Suganthan, G. Amaratunga, Ensemble deep learning for regression and time series forecasting, 2014 IEEE Symposium on Computational Intelligence in Ensemble Learning (CIEL), pp.1-6, Orlando, FL, USA, 2014 [19] E. Busseti, I. Osband, and S. Wong. Deep learning for time series modeling. Technical report, Stanford University, [20] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learning algorithm for boltzmann machines, Cognitive Science, vol. 9, pp , [21] K. S. Narendra and K. Parthasarathy, Gradient methods for optimization of dynamical systems containing neural networks, IEEE Transactions od Neural Networks, pp , March [22] G. Box, G. Jenkins, G. Reinsel. Time Series Analysis: Forecasting and Control, 4th Ed, Wiley, [23] K. De Brabanter, P. Dreesen, P. Karsmakers, K. Pelckmans, J. De Brabanter, J.A.K. Suykens and B. De Moor, Fixed-size LS-SVM applied to the Wiener-Hammerstein benchmark. In Proceedings of the 15th IFAC Symposium on System Identification, (pp ). Saint-Malo, France, 2009.

More information