Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine

Size: px

Start display at page:

Download "Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine"

Brooke Lyons
5 years ago
Views:

1 Commun. Theor. Phys. (Beijing, China) 48 (2007) pp c International Academic Publishers Vol. 48, No. 1, July 15, 2007 Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine GAO Cheng-Feng, CHEN Tian-Lun, and NAN Tian-Shi Department of Physics, Nankai University, Tianjin , China (Received July 28, 2006) Abstract Some problems in using ν-support vector machine (ν-svm) for the prediction of nonlinear time series are discussed. The problems include selection of various net parameters, which affect the performance of prediction, mixture of kernels, and decomposition cooperation linear programming ν-svm regression, which result in improvements of the algorithm. Computer simulations in the prediction of nonlinear time series produced by Mackey Glass equation and Lorenz equation provide some improved results. PACS numbers: a Key words: ν-svm, nonlinear time series, prediction 1 Introduction Nonlinear time series prediction appear in many application areas such as natural sciences, social sciences, economics, national defence, and so on. [1] Choosing appropriate method to find out intrinsic rules of nonlinear time series, and getting high quality prediction for its evolution is the goal that people go in for. Support vector machine (SVM) was first introduced by Vapnik et al. in 1995, [2] which obeys structural risk minimization instead of empirical risk minimization used in traditional machine learning theory. It can solve many problems which are often encountered in application of artificial neural network including over-fitting, curse of dimensionality, local minima of energy, and so on. It is a good learning method for the small data sets and has a wide variety of successful applications in many areas such as pattern recognition, linear regression. How to choose appropriate parameters due to different model and how to cut down the training time complexity are the main questions that people attract attention for all the time. In this paper we use ν-svm [3] to predict nonlinear time series of Mackey Glass equation and Lorenz equation. First, the relations between the parameters and prediction errors are studied, Through simulation, we can get appropriate parameters quickly. Second, the mixture of kernels to get better results are investigated. Finally we propose a decomposition cooperation linear programming ν-svm regression method to reduce the training time while remaining the small errors. 2 ν-svm ν-svm is a new type of support vector machine. In this paper, we mainly study ν-svm regression and use it to predict nonlinear time series produced by Mackey Glass equation and Lorenz equation. Supposing that we have a training set of N data points {x i, y i } N, where input pattern x i R d (d is the space dimension) for the i-th sample and y i R for the corresponding desired output. In the feature space, ν-svm can be described as y = w T φ(x) + b, (1) where the nonlinear mapping φ(x) maps the input data into a higher-dimensional feature space. In ν-svm, the objective function is min 1 ( 2 wt w + C νɛ + 1 N subject to the constraint N ) (ξ i + ξi ), (2) (w T φ(x i ) + b) y i ε + ξ i, y i (w T φ(x i ) + b) ε + ξ i, ξ i, ξ i 0, ε 0, i = 1,..., N. (3) By using Lagrange multiplier techniques, according to the Karush Kuhn Tucker (KKT) condition and rewriting the constraints, we can get the dual optimization problem: 1 min 2 (α α ) T Q(α α ) + y T (α α ), { e T (α α ) = 0, e T (α + α ) Cν, s.t. 0 α i, αi C/N, i = 1,..., N, (4) where Q ij = K(x i, x j ) = φ(x i ) T φ(x j ) is the kernel and e is a unit vector. α and α are the introduced Lagrange The project supported by National Natural Science Foundation of China under Grant No and the Doctoral Foundation of Ministry of Education of China under Grant No gaochengfeng@mail.nankai.edu.cn

2 118 GAO Cheng-Feng, CHEN Tian-Lun, and NAN Tian-Shi Vol. 48 multipliers. Thus, the regression estimative function (1) can be taken as the following form: N y = (α i αi )K(x i, x) + b, (5) where K(x i, x j ) is the kernel function. Any function that satisfies Mercer s condition [4] can be used as kernel function. Common kernel function is the Gaussian kernel, ( K(x i, x j ) = exp x i x j σ2), (6) where σ is the width of the Gaussian kernel. The new parameter ν is used to control training errors and the number of support vectors (SVs). To be more precise, Shölkopf et al. proved that ν is an upper bound on the fraction of error samples in N and a lower bound on the fraction of SVs in N. [3] 3 Simulation Results In this paper, the Quadratic Programming (Formula 4) is solved by using quadprog function in Matlab 6.5 toolbox and computed on the Pentium IV 3.0 GHz personal computer. 3.1 Mackey Glass Equation and Lorenz Equation Usually, the prediction based on nonlinear time series produced by Mackey Glass equation is regarded as a criterion for comparing the ability of different predicting method. Mackey Glass Equation The Mackey Glass equation is a time-delayed differential equation first proposed as a model of white blood cell production by Mackey and Glass. [5] It is a high-dimensional nonlinear dynamics equation in phase space. The data of Mackey-Glass equation are normalized as the following formula, ax t s x t = 1 + (x t s ) 10 + (1 β)x t s, (7) where α = 0.2, β = 0.1, and s is delayed time. When s 17, the system exhibits chaotic behavior with a fractal dimension. When s is larger the dimension will be higher. In this work, we choose s = 17. Lorenz Equation The Lorenz equations were discovered by Ed Lorenz in 1963 as a very simplified model of convection rolls in the upper atmosphere. [6] Lorenz obtained three simple ordinary differential equations: ẋ = σx + σy, ẏ = xz + γx y, ż = xy bz, (8) where x is proportional to the intensity of the convective rolls, while y is proportional to the temperature difference between the ascending and descending currents, and the variable z is proportional to the distortion of vertical temperature profile from linearity, σ, γ, and b are positive and are called the Prandtl number, the Rayleigh number, and a volume parameter. Equation exhibits chaotic behavior with σ = 10, γ = 28, b = 8/3. In our simulation experiment,we select x = 10, y = 10, z = 10. We should do processing of the time series before prediction and here the equation are initially normalized as the following formula: x i = x i min(x), i = 1, 2,..., (9) max(x) min(x) where max(x) and min(x) are the maximum and minimum values of the time series {x i } respectively. 3.2 Parameters C and ν It is important to select proper parameters in nonlinear time series prediction with support vector machine for their influencing the prediction performance. ν-svm has three parameters to be adjusted: regularized constant C determining the trade off between the empirical error and the regularized terms, the width of kernel σ, and parameters ν for controlling the number of SVs and errors. Since selecting the parameter σ is complex and when σ is around the experiential value the training errors varying in a small range, therefore one chooses the experiential value 1.7. [7] Then we observe the relationship between training errors and time. Fig. 1 RMSE of different ν and various C. The prediction error is measured with the root-meansquare error (RMSE): RMSE = 1 n (x i x i n )2, (10) where x i stands for the predicting value, x i is the desired value, and n denotes the number for prediction.

No. 1 Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine 119 The length of the nonlinear time series is set to be 1000, and we take the initial 50 data

3 No. 1 Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine 119 The length of the nonlinear time series is set to be 1000, and we take the initial 50 data for training to construct the model and then the following 950 data are used to test the prediction capability of the model and the embedded dimension is set to 6. We select the parameters as follows: C (10 4, 10 7 ), ν [0.2, 0.5, 0.8], step of variation: C = The simulation results are shown in Fig. 1. Fig. 2 RMSE of various C and ν varying in a small interval. Fig. 3 RMSE of various C and ν varying in a large interval. Fig. 4 Comparing the RMSE of the time series produced by Mackey Glass equation and Lorenz equation (where ( ) means that the time series is produced by the dynamic system from the 500th data point to the 1500th data point, and same as ( )). (a) RMSE of Mackey Glass equation ( ); (b) RMSE of Mackey Glass equation ( ); (c) RMSE of Lorenz equation ( ); (d) RMSE of Lorenz equation ( ).

4 120 GAO Cheng-Feng, CHEN Tian-Lun, and NAN Tian-Shi Vol. 48 From Fig. 1 we can see that the performance of prediction is influenced much when we select different C and ν, so we should confirm a set of parameters before predict. Since the character of time series produced by different model is not the same, it is difficult to propose a common set of parameters. Fig. 5 Comparing the training time of the time series produced by Mackey Glass equation and Lorenz equation. (a) Training time of Mackey Glass equation ( ); (b) Training time of Mackey Glass equation ( ); (c) Training time of Lorenz equation ( ); (d) Training time of Lorenz equation ( ). In order to get good results, we should try each parameter from small to large. For parameter C, we select a small interval from 10 4 to 10 7 and the simulation result is shown in Fig. 2. Since it costs too much time, we select a large interval, where the set of C values is given as a sequence 10 4, ,..., , The simulation results are shown in Fig. 3. Comparing Fig. 2 and Fig. 3, we find that the shapes of the curves are nearly the same. So in later work, we can fix the range of parameter C in [10 4, 10 7 ] with the exponential interval instead of the small interval. Now, let us study the parameters selecting by comparing the time series of Mackey Glass equation with the time series of Lorenz equation when using different data set in the same model. Simulation results are shown in Figs Figure 4 shows that the training errors approximately decrease first and then increase with increasing C. Figure 5 shows that all of the training time curves are nearly the same, that is, RMSE approximately decreases with increasing C and decreasing ν. Synthetically, we select ν = 0.3. Figure 6 shows that all the data sets will have a good performance at nearby C = From the above analysis, we have the common parameters C = 10 5 and ν = 0.3, and all of them show a good performance for the time series of both Mackey Glass equation and Lorenz equation. The above method of selecting parameters in a larger interval is efficient and fast as well, which can be also adopted

5 No. 1 Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine 121 for the parameter selecting of other models. Fig. 6 Section of Fig. 4, when ν = 0.3 (RMSE of various C). (a) RMSE of Mackey Glass equation ( ); (b) RMSE of Mackey Glass equation ( ); (c) RMSE of Lorenz equation ( ); (d) RMSE of Lorenz equation ( ). 3.3 Kernel Function It is also important to select the kernel function, for many of the characteristics of SVM are determined by the type of kernel function used. [8] There are many types of kernels, such as polynomial kernel, radial basis function, and sigmoid kernel, and so on. But all of them can be cataloged into two classes, i.e. the local kernel and the global kernel. RBF function formula (6) is a typical local kernel and Figure 7 is the curve of the RBF function when σ = 0.1, 0.2, 0.3, 0.4, 0.5, where 0.2 is the test input( x i of RBF is the test input). From Fig. 7, we can see that a local kernel only has an effect on the data points in the neighborhood of the test point. The polynomial kernel K(x, x i ) = [(x x i ) + 1] q (11) is a typical global kernel. Figure 8 is the curve of the polynomial kernel, where q = 1, 2, 3, 4, 5, 0.2 is the test input. From Fig. 8, we can see that all data points in the input domain have non-zero kernel values and the test data point has a global effect on the other data points. A local kernel has strong learning capacity but weak generalization ability while a global kernel has strong generalization ability but weak learning capacity. The quality of method is not only determined by its learning capacity but also its generalization ability. Recently, G.F. Smits considered the mixture of kernels, [8] which can result in both good

6 122 GAO Cheng-Feng, CHEN Tian-Lun, and NAN Tian-Shi Vol. 48 interpolation and extrapolation abilities as follows: K mix = λk poly + (1 λ)k rbf. (12) It also satisfies Mercer s condition. Here, we should confirm the optimal mixing coefficient λ (0, 1), In the simulation, we comparing the RMSE of different nonlinear time series produced by formula (7) and formula (8) for incarnating the affect of the mixture of kernels. Fig. 7 Example of a local kernel RBF. Fig. 8 Example of a global kernel polynomial. In order to get both of good learning capacity and good generalization ability we select σ = 1.7 for RBF, q = 2 for polynomial, and the best sets of parameters for Mackey Glass equation and Lorenz equation C = 10 5, ν = 0.3. The simulation results are shown in Tables 1 and 2. Table 1 RMSE of Lorenz equation time series with mixture of kernels. value of λ RMSE value of λ RMSE Table 2 RMSE of Mackey Glass equation time series with mixture of kernels. value of λ RMSE value of λ RMSE Table 1 shows that the best performance can be obtained when λ = We see that adopting mixture of kernels can get better performance and the RBF kernel remains playing an important role. Table 2 shows that the best performance can be obtained when λ = We see that it also gets best performance with mixture of kernels and using polynomial kernel only can get better performance than using RBF kernel only sometimes. Anyway, adopting mixture of kernel can decrease training errors while keeping training time unconverted. 3.4 Large Data Set Prediction Although SVM is proposed for small data sets, nowadays there are massive amounts of data available in many applications. At that time, training SVM on large data sets will be a very slow process and the training time and computational memory will become a practical bottleneck. The main reason is that the algorithm of SVM is a matrix calculating, when N becomes larger and the training time is increasing quickly. So how to reduce the training time and the memory becomes another important problem in

7 No. 1 Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine 123 SVM study. Recently, there are some methods used to train SVM on a large data set, such as Chunking, [0] Sequential Minimal Optimization (SMO), [10] and SVM light, [11] but all of them need an iteration time after time in order to search for an optimization result which results in that the convergence time becomes too slow. This paper improves the method of Ref. [12] and proposes a decomposition cooperation linear ν-support vector machine regression (DCL-ν-SVMR) algorithm. First, we divide training sets into several sub-work sets at where we use linear ν-svm training. By using mature linear programming algorithm, the computing speed can be raised. The algorithm includes three steps, i.e. training the SVM to get the SVs, cooperating these SVs into a new training set to train again and finally to predict. We can see the training speed of this algorithm is improved obviously from the simulation results. DCL-ν-SVMR Model We know that the SVs describe the property of whole training set, and the SVs of linear ν-svm are only partially of the training data. We can do regression learning by SVs in stead of the whole training data. It will decrease the training time and memory spending while keep a good predicting precision. The formula of linear ν-svm is shown as follows: 1 N min (α i + αi ) + C N (ξ i + ξi ) + Cνε, (13) N N subject to the constraint N (α i αi )K(x i, x j ) + b y j ε + ξ j, y j N (α i αi )K(x i, x j ) b ε + ξj, ξ i, ξ i, α i, α i, ε 0, i = 1,..., N. (14) Here we adopt linprog function in Matlab 6.5 toolbox to solve it. Through foregoing analysis, we propose an improved algorithm, i.e., decomposition cooperation linear regression based on ν-svm which mainly contains 4 steps. (i) Dividing the data set G into m work subsets G 1, G 2,..., G m ; (ii) Training for each sub-work set to pick out the SVs sets SV G i ; (iii) Cooperating the SVs set of each subset, and then getting G new ; (iv) Doing support vector regression at G new, and then producing regression estimative function. We train the large data set of nonlinear time series, which are produced by Mackey Glass equation by using DCL-ν-SVMR. The calculation results are shown in Table 3, where the length of the nonlinear time series is set to N = 1000, 2000, 3000 are used to train the network, and the following 1000 data points are predicted. The ultimate number of support vectors for pruning is half of the N. Table 3 Prediction for large data set produced by Mackey Glass equation. T (s) RMSE Pruning LS-SVM DCL-ν-SVMR Pruning LS-SVM DCL-ν-SVMR N = N = N = Table 3 indicates that it can get faster training speed by adopting DCL-ν-SVMR compared with Ref. [13], where the method of pruning the least important data for LS-SVM (least square support vector machine) was used while keeping small training errors. 4 Conclusion In this paper, we predict the nonlinear time series by applying ν-svm, and analyze the various net parameter settings that can affect the precision of prediction. Adopting traversal method to adjust parameters C and ν, the simulation results indicate that it can get proper parameters quickly. We adopt mixture of kernels instead of traditional kernels, and get good performance of the prediction. In addition, we propose a DCL-ν-SVMR algorithm, which overcomes the slow training speed problem in large data set training, and it decreases the training time greatly while keeps small training errors comparing with pruning LS-SVM. For further study, we are going to focus on the study of the mixture of kernels and predicting the noisy nonlinear time series by DCL-ν-SVMR which is believed to have a wide applied range in the prediction of practical data.

8 124 GAO Cheng-Feng, CHEN Tian-Lun, and NAN Tian-Shi Vol. 48 References [1] Neural Networks in Financial Engineering, eds. A.N. Referes, Y. Abu-Mostafa, J. Moody, and A. Weigend, World Scientific, Singapore (1996). [2] C. Cortes and V. Vapnik, Machine Learning 20 (1995) 273. [3] B. Shölkopf, et al., Neural Computation 12 (2000) [4] V.N. Vapnik, Statistical Learning Theory, Wiley, New York (2001). [5] M.C. Mackey and L. Glass, Science 197 (1977) 287. [6] E.N. Lorenz, J. Atmos. Sci. 20 (1963) 130. [7] Xu Rui-Rui, Chen Tian-Lun, and Gao Cheng-Feng, Commun. Theor. Phys. (Beijing, China) 45 (2006) 641. [8] G.F. Smits and E.M. Jordaan, Proc. of IJCNN 02 on Neural Networks 3 (2002) [9] P.S. Bradley and O.L. Mangasarian, Optimization Methods and Software 13 (2000) 1. [10] J. Platt, Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, in Advances in Kernel Methods - Support Vector Learning, eds. Bernhard Schölkopf, Christopher J.C. Burges, and Alexander J. Smola, MIT Press, Cambridge (1999) p [11] T. Joachims, Making Large-Scale Support Vector Machine Learning Practical, Advances in Kernel Methods, Support Vector Learning, eds. Bernhard Schölkopf, Christopher J.C. Burges, and Alexander J. Smola, MIT Press, Cambridge (1999) 169. [12] Feng Guo-He and Zhu Si-Ming, Journal of South China University of Technology 33 (2005) 19. [13] Xu Rui-Rui, Bian Guo-Xing, Gao Cheng-Feng, and Chen Tian-Lun, Commun. Theor. Phys. (Beijing, China) 43 (2005) 1056.

Discussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine

Commun. Theor. Phys. (Beijing, China) 43 (2005) pp. 1056 1060 c International Academic Publishers Vol. 43, No. 6, June 15, 2005 Discussion About Nonlinear Time Series Prediction Using Least Squares Support