Linear Dependency Between and the Input Noise in -Support Vector Regression

Similar documents
1162 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 5, SEPTEMBER The Evidence Framework Applied to Support Vector Machines

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Support Vector Machine Regression for Volatile Stock Market Prediction

Relevance Vector Machines for Earthquake Response Spectra

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

NON-FIXED AND ASYMMETRICAL MARGIN APPROACH TO STOCK MARKET PREDICTION USING SUPPORT VECTOR REGRESSION. Haiqin Yang, Irwin King and Laiwan Chan

Support Vector Machines

Analysis of Multiclass Support Vector Machines

Smooth Bayesian Kernel Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Formulation with slack variables


An introduction to Support Vector Machines

Adaptive Sparseness Using Jeffreys Prior

Eigenvoice Speaker Adaptation via Composite Kernel PCA

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Kernels for Multi task Learning

Outliers Treatment in Support Vector Regression for Financial Time Series Prediction

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Discriminative Direction for Kernel Classifiers

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

An Analytical Comparison between Bayes Point Machines and Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US)

Support Vector Machines Explained

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Machine Learning : Support Vector Machines

Probabilistic Regression Using Basis Function Models

Support Vector Machines

Learning Kernel Parameters by using Class Separability Measure

Model Selection for LS-SVM : Application to Handwriting Recognition

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machine (SVM) and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

A NEW MULTI-CLASS SUPPORT VECTOR ALGORITHM

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

Machine Learning 4771

SVM Incremental Learning, Adaptation and Optimization

Kernel methods and the exponential family

Recent Advances in Bayesian Inference Techniques

Support Vector Method for Multivariate Density Estimation

Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

Dynamical Modeling with Kernels for Nonlinear Time Series Prediction

Relevance Vector Machines

STA 4273H: Sta-s-cal Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Parametric Techniques Lecture 3

Support Vector Machine Classification via Parameterless Robust Linear Programming

Gaussian process for nonstationary time series prediction

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Linear Models for Regression

Cluster Kernels for Semi-Supervised Learning

MinOver Revisited for Incremental Support-Vector-Classification

Parametric Techniques

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Support Vector Learning for Ordinal Regression

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines and Kernel Methods

Variational Principal Components

Pattern Recognition and Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Neutron inverse kinetics via Gaussian Processes

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Progress In Electromagnetics Research C, Vol. 11, , 2009

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Support Vector Machines and Kernel Algorithms

STA414/2104 Statistical Methods for Machine Learning II

Kernel Machines and Additive Fuzzy Systems: Classification and Function Approximation

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Using Kernel PCA for Initialisation of Variational Bayesian Nonlinear Blind Source Separation Method

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

ECE521 week 3: 23/26 January 2017

Neural network time series classification of changes in nuclear power plant processes

Linear & nonlinear classifiers

Iteratively Reweighted Least Square for Asymmetric L 2 -Loss Support Vector Regression

Expectation Propagation for Approximate Bayesian Inference

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Introduction to SVM and RVM

Forecast daily indices of solar activity, F10.7, using support vector regression method

Multiple Similarities Based Kernel Subspace Learning for Image Classification

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Least Squares Regression

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

y(x) = x w + ε(x), (1)

Statistical Learning Reading Assignments

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

CS 195-5: Machine Learning Problem Set 1

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Least Squares Regression

Transcription:

544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector regression ( -SVR) algorithm, one has to decide a suitable value for the insensitivity parameter. Smola et al. considered its optimal choice by studying the statistical efficiency in a location parameter estimation problem. While they successfully predicted a linear scaling between the optimal the noise in the data, their theoretically optimal value does not have a close match with its experimentally observed counterpart in the case of Gaussian noise. In this paper, we attempt to better explain their experimental results by studying the regression problem itself. Our resultant predicted choice of is much closer to the experimentally observed optimal value, while again demonstrating a linear trend with the input noise. Index Terms Support vector machines (SVMs), support vector regression. I. INTRODUCTION IN recent years, the use of support vector machines (SVMs) on various classification regression problems have been increasingly popular. SVMs are motivated by results from statistical learning theory, unlike other machine learning methods, their generalization performance does not depend on the dimensionality of the problem [3], [24]. In this paper, we focus on regression problems consider the -support vector regression ( -SVR) algorithm [4], [20] in particular. The -SVR has produced the best result on a timeseries prediction benchmark [11], as well as showing promising results in a number of different applications [7], [12], [22]. One issue about -SVR is how to set the insensitivity parameter. Data-resampling techniques such as cross-validation can be used [11], though they are usually very expensive in terms of computation /or data. A more efficient approach is to use a variant of the SVR algorithm called -support vector regression ( -SVR) [17]. By using another parameter to trade off with model complexity training accuracy, -SVR allows the value of to be automatically determined. Moreover, it can be shown that represents an upper bound on the fraction of training errors a lower bound on the fraction of support vectors. Thus, in situations where some prior knowledge on the value of is available, using -SVR may be more convenient than -SVR. Another approach is to consider the theoretically optimal choice of. Smola et al. [19] tackled this by studying the simpler location parameter estimation problem derived the asymptotically optimal choice of by maximizing statistical efficiency. They also showed that this optimal value scales linearly with the noise in the data, which is confirmed in the experiment. However, in the case of Gaussian noise, their predicted value of this optimal does not have a close match with their experimentally observed value. In this paper, we attempt to better explain their experimental results. Instead of working on the location parameter estimation problem as in [19], our analysis will be based on the original -SVR formulation. The rest of this paper is organized as follows. Brief introduction to the -SVR is given in Section II. The analysis of the linear dependency between the input noise level is given in Section III, while the last section gives some concluding remarks. II. -SVR In this section, we introduce some basic notations for -SVR. Interested readers are referred to [3], [20], [24] for more complete reviews. Let the training set be, with input output.in -SVR, is first mapped to in a Hilbert space (with inner product ) via a nonlinear map. This space is often called the feature space its dimensionality is usually very high (sometimes infinite). Then, a linear function is constructed in such that it deviates least from the training data according to Vapnik s -insensitive loss function if otherwise while at the same time is as flat as possible (i.e., small as possible). Mathematically, this means minimize is as subject to (1) where is a user-defined constant. It is well known that (1) can be transformed to the following quadratic programming (QP) problem: Manuscript received February 13, 2002; revised November 13, 2002. The authors are with the Department of Computer Science, Hong Kong University of Science Technology, Clear Water Bay, Hong Kong (e-mail: jamesk@cs.ust.hk; ivor@cs.ust.hk). Digital Object Identifier 10.1109/TNN.2003.810604 maximize (2) 1045-9227/03$17.00 2003 IEEE

KWOK AND TSANG: LINEAR DEPENDENCY 545 subject to However, recall that the dimensionality of thus also of, is usually very high. Hence, in order for this approach to be practical, a key characteristic of -SVR kernel methods in general, is that one can obtain in (2) without having to explicitly obtain first. This is achieved by using a kernel function such that For example, the -order polynomial kernel corresponds to a map into the space spanned by all products of exactly order of [24]. More generally, it can be shown that any function satisfying Mercer s theorem can be used as kernel each will have an associated map such that (3) holds [24]. Computationally, -SVR ( kernel methods in general) also has the important advantage that only quadratic programming 1 not nonlinear optimization, is involved. Thus, the use of kernels provides an elegant nonlinear generalization of many existing linear algorithms [2], [3], [10], [16]. III. LINEAR DEPENDENCY BETWEEN AND THE INPUT NOISE PARAMETER In order to derive a linear dependency between the scale parameter of the input noise model, Smola et al. [19] considered the simpler problem of estimating the univariate location parameter from a set of data points (3) maximum a posteriori (MAP) estimation. As discussed in [14], [20], [21], the -insensitive loss function leads to the following probability density function on Notice that [14], [20], [21] do not have the factor in (5), but is introduced here to play the role of controlling the noise level 2 [6]. With the Gaussian prior 3 on on applying the Bayes rule,, we obtain (5) const (6) On setting, the optimization problem in (1) can be interpreted as finding the MAP estimate of at given values of. A. Estimating the Optimal Values of In general, the MAP estimate in (6) depends on the particular training set a closed-form solution is difficult to obtain. To simplify the analysis, we replace in (6) by its expectation Here, s are i.i.d. noise belonging to some distribution. Using the Cramer-Rao information inequality for unbiased estimators, the maximum likelihood estimator of with an optimal value of was then obtained by maximizing its statistical efficiency. In this paper, instead of working on the location parameter estimation problem, we study the regression problem of estimating the (possibly multivariate) weight parameter given a set of, with Here, follows distribution follows distribution. The corresponding density function on is denoted. Notice that the bias term has been dropped here for simplicity this is equivalent to assuming that the s have zero mean. Moreover, using the notation in Section II, in general one can replace in (4) by in feature space thus recovers the original -SVR setting. Besides, while the work in [19] is based on maximum likelihood estimation, it is now well known that -SVR is related instead to 1 The SVM problem can also be formulated as a linear programming problem [9], [18] instead of a quadratic programming problem. (4) Equation (6) thus becomes On setting its partial derivative with respect to be shown that has to satisfy const (7) to zero, it can 2 To be more precise, we will show in Section III-B that the optimal value of is inversely proportional to the noise level of the input noise model. 3 In this paper, are regarded as hyperparameters while is treated as a constant. Thus, dependence on will not be explicitly mentioned in the sequel. (8)

546 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 To find suitable values for the hyperparameters, we use the method of type II maximum likelihood prior 4 (ML-II) in Bayesian statistics [1]. The ML-II solutions of are those that maximize, which can be obtained by integrating out where is the width of the integr measures the posterior uncertainty of. Assuming that the contributions due to are comparable at different values of, maximizing with respect to thus becomes maximizing. This is the same as maximizing,or approximately. Differentiating (7) with respect to then using (8), we have These can then be used to solve for. (12) B. Applications to Some Common Noise Models Recall that the -insensitive loss function implicitly corresponds to the noise model in (5). Of course, in cases where the underlying noise model of a particular data set is known, the loss function should be chosen such that it has a close match with this known noise model, while at the same time ensuring that the resultant optimization problem can still be solved efficiently. It is thus possible that the -insensitive loss function may not be best. For example, when the noise model is known to be Gaussian, the corresponding loss function is the squared loss function the resultant model becomes a regularization network [5], [13]. 5 However, an important advantage of -SVR using the -insensitive loss function, just like SVMs using the hinge loss in classification problems, is that sparseness of the dual variables can be ensured. On the contrary, sparseness will be lost if the squared loss function is used instead. Thus, even for Gaussian noise, the -insensitive loss function is sometimes still desirable the resultant performance is often very competitive. In this section, by solving (11) (12), we will show that there is always a linear relationship between the optimal value of the noise level, even when the true noise model is different from (5). As in [19], three commonly used noise models, namely the Gaussian, Laplacian, uniform models, will be studied. 1) Gaussian Noise: For the Gaussian noise model (9) where is the stard deviation. Define. It can be shown that (11) (12) reduce to (13) respectively. Setting (9) (10) to zero, we obtain (10) (14) (11) 4 Notice that MacKay s evidence framework [8], which has been popularly used in the neural networks community, is computationally equivalent to the ML-II method. where is the complementary error function [15]. Substituting (13) into (14) after performing the integration, 6 it can be shown that always appears as the factor in the solution, thus, there is always a linear relationship between the optimal values of. 5 This is sometimes also called the least-squares SVM [23] in the SVM literature. 6 Here, the integration is performed by using the symbolic math toolbox of Matlab.

KWOK AND TSANG: LINEAR DEPENDENCY 547 Fig. 1. Objective function h(=) to be minimized in the case of Gaussian noise. If we assume that A plot of numerically as is shown in Fig. 1. Its minimum can be obtained (17) (15) also that the variation of with is small, 7 then maximizing in (7) is effectively the same as minimizing, which is the same as minimizing 8 7 Notice that from (8), we have @ ^w @ = I + N xx p ^w x 0 jx + p ^w x + jx p(x)dx 1 x p ^w x 0 jx (16) By substituting (15), (17) into (29), we can also obtain the optimal value of as. This confirms the role of in controlling the noise variance, as mentioned in Section III. In the following, we repeat an experiment in [19] to verify this ratio of. The target function to be learned is sinc. The training set consists of 200 s drawn independently from a uniform distribution on [ 1, 1] the corresponding outputs have Gaussian noise added. Testing error is obtained by directly integrating the squared difference between the true target function the estimated function over the range [ 1, 1]. As in [19], the model selection problem for the regularization parameter in (1) is side-stepped by always choosing the value of that yields the smallest testing error. The whole experiment is repeated 40 times the spline kernel with an infinite number of knots is used. Fig. 2 shows the linear relationship of obtained from the experiment. This is close to the value of obtained in the experiment in [19] is also in good agreement with our predicted ratio of. In comparison, the theoretical results in [19] suggest the optimal choice of. 2) Laplacian Noise: Next, we consider the Laplacian noise model 0p ^w x + jx p(x)dx where I is the identity matrix. @ ^w=@ thus involves the difference of two very similar terms will be small. 8 More detailed derivations are in the Appendix. Here, we only consider the simpler case when is one-dimensional, with uniform density over the range [, ] (i.e.,

548 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Fig. 2. Relationship between the optimal value of the Gaussian noise level. Fig. 3. Objective function h(=) to be minimized in the case of Laplacian noise. ). After tedious computation, it can be shown that (11) (12) reduce to (18) (19)

KWOK AND TSANG: LINEAR DEPENDENCY 549 (a) Fig. 4. Relationship between the optimal value of the Laplacian noise level. (a) One-dimensional x: (b) Two-dimensional x: (b) Substituting (19) back into (18), we notice again that always appears in the factor thus there is always a linear relationship between the optimal value of under the Laplacian noise. Recall that the variance for the uniform distribution is that for the exponential distribution 9 9 The density function of the exponential distribution Expon() is p() = exp(0) (where >0; 0). is. As in Section III-B1, we consider assume that. With, we also obtain. Consequently (20)

550 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Fig. 5. Scenario when L >+ = ~w 0 ^w in the case of uniform noise. hence (21) Analogous to Section III-B1, by plugging in (18), (19), (21), it can be shown that the problem reduces to minimizing (Fig. 3) as shown in the equation at the bottom of the page. Again, the solution can be obtained numerically as, which was also obtained in [19]. Notice that this is intuitively correct as the density function in (5) degenerates to the Laplacian density when. Moreover, we can also obtain the optimal value of as, again showing that is inversely proportional to the noise level of the Laplacian noise model. To verify the ratio of, we repeat the experiment in Section III-B1 but with the Laplacian noise instead. Moreover, as our analysis applies only when is one dimensional, we also investigate experimentally the optimal value of in the two-dimensional case. The ratios obtained for the one- twodimensional cases are, respectively (Fig. 4), which are close to our prediction of. 3) Uniform Noise: Finally, we consider the uniform noise model (22) if where is the indicator function. otherwise. Again, we only consider the simpler case when is one-dimensional, with uniform density over the range [, ]. Moreover, only the case when will be discussed here. Derivation for the case when is similar. Whereas, for (i.e., ), the corresponding SVR solution will be very poor (Fig. 5) thus usually not the case of interest. It can be shown that (11) (12) reduce to (23) Combining simplifying, we have. Substituting back into (23), we obtain (24) Again, we notice that always appears in the factor, thus, there is also a linear relationship between the optimal value of in the case of uniform noise. As in Section III-BI, consider assume that. Using (20) again, we obtain. On substituting back into (24) proceed as in Section III-BI, we obtain, after tedious computation,. This is intuitively correct since the density function

KWOK AND TSANG: LINEAR DEPENDENCY 551 (a) (b) Fig. 6. Relationship between the optimal value of the uniform noise level. (a) One-dimensional x: (b) Two-dimensional x: in (5) becomes effectively the same as (22) when, as noise with magnitude larger than the size of the -tube then becomes impossible. Moreover, as a side-product, we can also obtain the optimal value of as, showing that is again inversely proportional to. Fig. 6 shows the linear relationships obtained from the experiment, with for the one- two-dimensional cases, respectively. These are again in close match with our prediction of. IV. CONCLUSION In this paper, we study the -SVR problem derive the optimal choice of at a given value of the input noise parameter.

552 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 While the results in [19] considered only the problem of location parameter estimation using maximum likelihood, our analysis is based on the original -SVR formulation corresponds to MAP estimation. Consequently, our predicted ratio of in the case of Gaussian noise is much closer to the experimentally observed optimal value. Besides, we accord with [19] in that scales linearly with the scale parameter under the Gaussian, Laplacian, uniform noise models. In order to apply these linear dependency results in practical applications, one has to first arrive at an estimate of the noise level. One way to obtain this is by using Bayesian methods (e.g., [6]). In the future, we will investigate the integration of these two also its applications in some real-world problems. Besides, the work here is also useful in designing simulation experiments. Typically, a researcher/practitioner may be experimenting a new technique on -SVR, while not directly addressing the issue of finding the optimal value of. The results here can then be used to ascertain that a suitable value/range of has been chosen in the simulation. Finally, our analyses under the Laplacian uniform noise models are restricted to the case when the input is one dimensional with uniform density over a certain range. Extensions to the multivariate case to other noise models will also be investigated in the future. Using (25), (13) then becomes Similarly, using (26), (14) becomes (27) APPENDIX Recall the following notations introduced in Section III-B1: Using (13) (25), this simplifies to (28), shown at the bottom of the page, from which we obtain In the following, we will use the second-order Taylor series expansions for (25) (26) Substituting (27) into (28) simplifying, we have (29) (30) (28)

KWOK AND TSANG: LINEAR DEPENDENCY 553 Now, maximizing in (7) is thus the same as minimizing. Substituting in (15), (27), (29), (30), this becomes minimizing in (16). REFERENCES [1] J. O. Berger, Statistical Decision Theory Bayesian Analysis, 2nd ed, ser. Springer Series in Statistics. New York: Springer-Verlag, 1985. [2] V. Cherkassky F. Mulier, Learning From Data: Concepts, Theory Methods. New York: Wiley, 1998. [3] N. Cristianini J. Shawe-Taylor, An Introduction to Support Vector Machines. New York: Cambridge Univ. Press, 2000. [4] H. D. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines, in Advances in Neural Information Processing Systems 9, M. C. Mozer, M. I. Jordan, T. Petsche, Eds. San Mateo, CA: Morgan Kaufmann, 1997, pp. 155 161. [5] T. Evgenious, M. Pontil, T. Poggio, Regularization networks support vector machines, in Advances in Computational Mathematics, 1999. [6] M. H. Law J. T. Kwok, Bayesian support vector regression, in Proc. 8th Int. Workshop Artificial Intelligence Statistics, Key West, FL, 2001, pp. 239 244. [7] Y. Li, S. Gong, H. Liddell, Support vector regression classification based multi-view face detection recognition, in Proc. 4th IEEE Int. Conf. Face Gesture Recognition, Grenoble, France, 2000, pp. 300 305. [8] D. J. C. MacKay, Bayesian interpolation, Neural Comput., vol. 4, no. 3, pp. 415 447, May 1992. [9] O. L. Mangasarian D. R. Musicant, Robust linear support vector regression, IEEE Trans. Pattern Anal. Machine Learning, vol. 22, Sept. 2000. [10] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, B. Schölkopf, An introduction to kernel-based learning algorithms, IEEE Trans. Neural Networks, vol. 12, pp. 181 202, Mar. 2001. [11] K. R. Müller, A. J. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen, V. Vapnik, Predicting time series with support vector machines, in Proc. Int. Conf. Artificial Neural Networks, 1997, pp. 999 1004. [12] C. Papageorgiou, F. Girosi, T. Poggio, Sparse correlation kernel reconstruction, in Proc. Int. Conf. Acoust., Speech, Signal Processing, Phoenix, AZ, 1999, pp. 1633 1636. [13] T. Poggio F. Girosi, A Theory of Networks for Approximation Learning. Cambridge, MA: MIT Press, 1989. [14] M. Pontil, S. Mukherjee, F. Girosi, On the Noise Model of Support Vector Machine Regression. Cambridge, MA: MIT Press, 1998. [15] W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery, Numerical Recipes in C, 2nd ed. New York: Cambridge Univ. Press, 1992. [16] B. Schölkopf, C. Burges, A. e. Smola, Advances in Kernel Methods Support Vector Learning. Cambridge, MA: MIT Press, 1998. [17] B. Schölkopf, A. Smola, R. C. Williamson, New support vector algorithms, Neural Comput., vol. 12, no. 4, pp. 1207 1245, 2000. [18] A. Smola, B. Schölkopf, G. Rätsch, Linear programs for automatic accuracy control in regression, in Proc. Int. Conf. Artificial Neural Networks, 1999, pp. 575 580. [19] A. J. Smola, N. Murata, B. Schölkopf, K.-R. Müller, Asymptotically optimal choice of -loss for support vector machines, in Proc. Int. Conf. Artificial Neural Networks, 1998, pp. 105 110. [20] A. J. Smola B. Schölkopf, A Tutorial on Support Vector Regression, Royal Holloway College, London, U.K., NeuroCOLT2 Tech. Rep. NC2-TR-1998-030, 1998. [21] A. J. Smola, B. Schölkopf, K.-R. Müller, General cost functions for support vector regression, in Proc. Australian Congr. Neural Networks, 1998, pp. 78 83. [22] M. Stitson, A. Gammerman, V. N. Vapnik, V. Vovk, C. Watkins, J. Weston, Support vector regression with ANOVA decomposition kernels, in Advances in Kernel Methods Support Vector Learning, B. Schölkopf, C. Burges, A. Smola, Eds. Cambridge, MA: MIT Press, 1999. [23] J. A. K. Suykens, Least squares support vector machines for classification nonlinear modeling, in Proc. 7th Int. Workshop Parallel Applications Statistics Economics, 2000, pp. 29 48. [24] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. James T. Kwok received the Ph.D. degree in computer science from the Hong Kong University of Science Technology in 1996. He then joined the Department of Computer Science, Hong Kong Baptist University as an Assistant Professor. He returned to the Hong Kong University of Science Technology in 2000 is now an Assistant Professor in the Department of Computer Science. His research interests include kernel methods, artificial neural networks, pattern recognition machine learning. Ivor W. Tsang received his B.Eng. degree in computer science from the Hong Kong University of Science Technology (HKUST) in 2001. He is currently a master student in HKUST. He was the Honor Outsting student in 2001. His research interests include machine learning kernel methods.