A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami

Size: px

Start display at page:

Download "A Modified ν-sv Method For Simplified Regression A. Shilton M. Palaniswami"

Meryl Patterson
5 years ago
Views:

1 A Modified ν-sv Method For Simplified Regression A Shilton apsh@eemuozau), M Palanisami sami@eemuozau) Department of Electrical and Electronic Engineering, University of Melbourne, Victoria - 3 Australia Abstract In the present paper e describe a ne algorithm for Support Vector regression SVR) Like standard ν-svr algorithms, this algorithm automatically adjusts the radius of insensitivity tube idth ε) to fit the data Hoever, this is achieved ithout additional complexity in the optimisation problem Moreover, by careful modification of the kernel function, e are able to significantly simplify the form of the dual SVR optimisation problem ITRODUCTIO Support Vector Machines are a relatively ne class of learning machines that have evolved from the concepts of structural risk imisation SRM) in the pattern recognition case) and regularisation theory in the regression case) The major difference beteen Support Vector Machines SVMs) and many other eural etork ) approaches is that instead of tackling problems using the traditional empirical risk imisation ERM) method, SVMs use the concept of regularised ERM This has enabled people to use SVMs ith potentially huge capacities on smaller datasets ithout running into the usual difficulties of overfitting and poor generalisation performance Geometrically, the basis of SVM theory is to nonlinearly map the input data into some possibly infinite dimensional) feature space here the problem may be treated as a linear one In particular, hen tackling regression problems using SVMs, the output is a linear function of position in feature space Hoever, the complexities of this feature space and the non-linear map associated ith it) are hidden using a kernel function It is this ability to hide complexity resulting in a simple linearly constrained -dimensional quadratic programg problem ith no non-global ima), along ith the ability to use complex models hile avoiding overfitting, that has made SVM methods so popular over recent years In the present paper, e give a ne formulation for the SVM regression problem that, hile incorporating many of the features of existing methods in particular, the ν-support Vector Regression ν-svr) methods), dispenses ith much of the complexity that comes ith these formulations Specifically, by making the primal cost function entirely quadratic in nature, e are able to reduce the SVM training problem to a particularly simple quadratic programg problem ith a unique global ima hose only constraints are the positivity of all variables ε-sv REGRESSIO The standard SV regression problem is formulated as follos Suppose e are given a training set: Θ = {x,z ),x,z ),,x,z )} x i R dl z i R hich is assumed to have been generated based on some unknon but ell defined map: ĝ : R dl R so that: z i = ĝ ˆx i ) + noise x i = ˆx i + noise We implicitly, as ill be seen later) define a set of functions ϕ j : R dl R, j d H, hich collectively make up a map from input space to feature space, namely ϕ : R dl R dh, ϕ x) = ϕ x) ϕ x) ϕ dh x) Using the functions ϕ j : R dl R, j d H, e aim to find a non-linear approximation to ĝ: g x) = T ϕ x) + b hich is a linear function of position in feature space The usual ε-svr method of selecting and b is to imise the regularised risk functional: R,b,ξ,ξ ) =,b,ξ,ξ T + C T ξ + C T ξ T ϕx i ) + b ) z i ε ξ i T ϕx i ) + b ) z i ε ξi ξ,ξ ) here T characterises the complexity of the model the larger T, the larger the gradient of g x) in feature space, and hence the more g x) may vary for a given variation in input, x); and T ξ + T ξ the empirical risk associated ith it The constant C > controls the trade-off beteen empirical risk imisation and potential overfitting) if C is large and complexity imisation and potential underfitting) if C is small The constant ε > term in ) is included to give the model a degree of noise insensitivity Essentially, so long as z i ε g x i ) z i + ε, ξi = ξ i =, and so there ill be no empirical risk associated ith small perturbations

2 specifically, those perturbations hich leave the point lying inside the ε tube), giving the risk functional a degree of noise immunity assug that ε is ell matched to the noise present in the training data) Using Lagrange multiplier techniques, it is straightforard to sho that the dual form of ) is: T ] T α α α Lα,α α, ) = H[ s α C C T α ) = ) z ε s = [ z ε ] G G G G G i,j = K x i,x j ) The function K x i,x j ) = ϕx i ) T ϕx j ) is called the kernel function It is not difficult to sho that our approximation function g x) may be ritten in terms of the kernel function: g y) = i α i i )K x i,y) + b 3) ote that the feature map functions ϕ j : R dl R are hidden by the kernel function It is ell knon that for any symmetric function K : R dl R dl R satisfying Mercer s condition [], [4] there exists an associated set of feature mapsϕ j : R dl R although calculating hat these maps are may not be a trivial exercise) Indeed, e may start ith such a kernel function and, ith no knoledge of ϕ at all except that it exists), optimise and use a SVregressor Feature space remains entirely hidden throughout the process oting that each training vector corresponds to one pair α i, αi, and that one of these ill be zero for all i, e may divide our training vectors into three distinct classes, namely: on-support vectors: α i = αi =, ξ i = ξi = Boundary vectors: C α i αi C, ξ i = ξi = Error vectors: α i = C, ξ i > or αi = C, ξ i > Support vectors are any vectors hich contribute to 3), ie both boundary and error vectors We define S to be the number of support vectors in the training set, and E to be the number of error vectors on-support vectors are said to lie inside the ε-tube, boundary vectors on the edge of the ε-tube and error vectors outside the ε-tube ν-sv REGRESSIO One difficulty ith the ε-sv is the selection of ε Without some additional information about the noise contained in our training set hich is in general not available), the selection of ε becomes a matter of guessork To overcome this problem, Scholkopf et al [5] introduced the ν-svr formulation of the problem, hich includes an additional term in the primal problem to trade-off the tube size ε, no longer a constant) against model complexity and empirical risk From [5], the primal formulation is:,b,ξ,ξ,ε R,b,ξ,ξ,ε) = T + Cνε + C T ξ + T ξ ) T ϕ x i ) + b ) z i ε ξ i T ϕx i ) + b ) z i ε ξ i ξ,ξ ε here ν > is a constant The associated dual is: T [ α α α Lα,α α, ) = H α C C T α ) = T α + ) Cν ] T s 4) 5) z here s = and H and G are as before z The advantage of this form lies in the properties of the constant ν It can be shon [5] that: ν is an upper bound on the fraction of error vectors in the training set ν is an loer bound on the fraction of support vectors in the training set Hence ν may be selected ithout prior knoledge of the characteristics of the noise present in the training data, and the regressor ill select ε to best fit this, based on the requirements ν makes of the fractions of error and support vectors MODIFIED ν-sv REGRESSIO One practical difficulty ith 5) is the complexity of the constraint set, and in particular the presence of the upper bound constraint T α + ) Cν We ould like to remove this constraint ithout losing the ability to automatically select ε based on another, more useful parameter, ν Consider the primal form of the standard ν-sv regression, 4) The term Cνε is effectively a linear regularisation term for the variable ε in much the same ay that T is a regularisation term for the variable ) We replace this linear regularisation term ith a quadratic regularisation term, Cν ε, to get the ne regularised risk functional: R,b,ξ,ξ,ε) = T + Cν ε,b,ξ,ξ,ε + C T ξ + T ξ ) T ϕx i ) + b ) z i ε ξ i T ϕ x i ) + b ) z i ε ξi ξ,ξ 6) here, once again, ν > is a constant ote that this problem does not contain an inequality ε Hoever, the optimal solution to 6) ill satisfy this constraint To see

3 hy this must be the case, suppose that the optimal solution does not satisfy ε Given this solution, if e replace ε < ith ε = then clearly the inequality constraints in 6) ill still be satisfied Furthermore, R,b,ξ,ξ,ε) < R,b,ξ,ξ,) ε Therefore any solution not satisfying ε is not optimal, so a constraint of the form ε ould be superfluous Folloing the same approach as used for automatically biased SVMs [], e may define an augmented feature space map as follos: ϕx) ϕ A x,d) = d We also define the augmented eight vector: A = ε Using these definitions, 6) may be re-ritten thus: A,b,ξ,ξ R A,b,ξ,ξ ) = T A Q A + C T ξ + T ξ ) T A ϕ A x i,+) + b ) z i ξ i T A ϕ A x i, ) + b ) z i ξ i ξ,ξ Q = [ I T Cν ] 7) To form the dual problem, e introduce the non-negative Lagrange multipliers α i and αi associated ith the first to inequality constraints of 7), respectively Folloing the usual methods, it is not difficult to sho that: = α i αi )ϕx i) i 8) ε = Cν T α + T α) hich leads us to the dual formulation: T [ α α Lα,α α, ) = H α C C T α ) = z s = [ z ] G+ G G G + G + = G + Cν T G = G Cν T G i,j = K x i,x j ) ] [ α ] T s 9) Considering 9), it should be noted that: The hessian matrix H is positive semi-definite and the constraints are linear Hence there ill be no nonglobal ima While the modified ν-sv regression method incorporates tube-shrinking into its design, the constraint set of 9) is no more complex than the standard ε-svr dual, ) We no investigate the effect of the constant ν Consider the primal form of the modified ν-svr problem, namely 6) Suppose e have a solution and in particular ε, ξ and ξ ) hich e claim is optimal If this claim is true then: ξ i = { g xi ) z i ε) if α i = C otherise ξ i = { g xi ) z i ε) if i = C otherise ) Furthermore, any change in ε and requisite modification of ξ and ξ to satisfy )) should lead to an increase in R,b,ξ,ξ,ε) Suppose e increase ε by some very small amount, ε >, hile keeping and b constant, and changing ξ and ξ appropriately to satisfy ) The change in R,b,ξ,ξ,ε) ill be, to first order: R = ε ξ i> R = Cνε E ξ i ) ) ε ξ i > ξ i )) ε If our original solution as optimal, then R or, equivalently, ε > E ν Applying the same argument to a small decrease, ε <, in ε, it is not difficult to see that: R = ε ) ξ i )) α i> i > ξ ε i R = C ) νε S ε S and thus, if the original solution ere optimal, ε < ν To summarise, for the modified ν-sv regressor described in this section, the folloing bounds hold: ε > E ν ε < S ν The constant ν controls a bounding relation beteen the tube-idth ε and the fraction of support and error vectors FURTHER SIMPLIFICATIOS Automatic Biasing We can remove the equality constraint in 9) by using the automatic biasing method of [3], [] We extend 6) by adding an additional term, β b, to the regularised risk functional to get:,b,ξ,ξ,ε R,b,ξ,ξ,ε) = T + Cν + C T ξ + T ξ ) T ϕx i ) + b ) z i ε ξ i T ϕ x i ) + b ) z i ε ξ i ξ,ξ ε + β b )

4 here β > is a constant Our augmented feature map becomes: ϕx) ϕ A x,d) = d and likeise the augmented eight vector is: A = ε b Hence e can re-rite ) thusly: A,b,ξ,ξ R A,b,ξ,ξ ) = T A Q A + C T ξ + T ξ ) suchthat : T A ϕ A x i,+) z i ξ i T A ϕ A x i, ) z i ξ i ξ,ξ Q = I T Cν T β Folloing the usual method, e find that: = α i αi )ϕx i) i ε = Cν T α + T ) b = β T α T ) ) Hence the dual problem becomes: T T α α α Lα,α α, ) = H s α C C 3) here s is unchanged from the previous section and: G+ G G G + ) G + = G + G = G Cν + β Cν β G i,j = K x i,x j ) T ) T hich is essentially the same as 9), except that there is no equality constraint present Quadratic Empirical Risk Finally, e may remove the upper bound constraints on α and hile at the same time making H positive definite by modifying the form of the empirical risk component of our regularised risk functional as follos:,b,ξ,ξ,ε R,b) = T + β b + Cν ε + C ξt ξ + C ξt ξ T ϕx i ) + b ) z i ε ξ i T ϕx i ) + b ) z i ε ξi 4) ote that 4) contains no loer bound on ξ,ξ This is unnecessary for the same reason that a loer bound on ε is unnecessary ie such a constraint ould be superfluous) In this case, our augmented feature map is: ϕ A x,d,i,j) = ϕx) d de i de j here the vectore i R consists of all zero elements except for the i th element, hich is and hence e = ) We also define the augmented eight vector: A = ε b ξ ξ Using these definitions, 6) may be re-ritten as: A R A ) = T A Q A Q = A Tϕ A x i,+,i,) z i 5) A Tϕ A x i,,,i) z i I T Cν T T T β T T C I C I Folloing the usual method, e find that: The dual problem becomes: Lα,α α, ) = α = α i αi )ϕx i) i ε = Cν T α + T ) b = β T α T ) ξ = C α ξ = C α [ α ] T [ α H ] [ α here s is unchanged from the previous section and: G+ G G G + ) G + = G + Cν + β T + ) C I G = G T Cν β G i,j = K x i,x j ) ] T s 6) 7) hich is essentially the same as 3), except that there are no upper bounds present and, furthermore, H is positive

5 Figure Modified ν-svr ith ν = ε = 7) Figure Modified ν-svr ith ν = 8 ε = 9) definite, implying that there is a unique global solution One possible disadvantage of this form is a slight decrease in the sparsity of the solution if the training set is degenerate E S The bounds ε > ν and ε < ν no longer apply to this formulation Both are superseded by a stricter, equality relation Before proceeding to derive this relation, e must first define the folloing quantity: ē = ξ i + ξi ) i= ē is the mean error distance outside the ε tube) of all training points including non-error points, hich are defined as lying at a distance of from the ε-tube) Using 7), it is easy to see that the folloing relation ill hold for the form of modified ν-sv regressor described in the present section: ε = ν ē So, for modified ν-sv regressors using quadratic empirical risk, as described in the present section, there is a direct proportional relationship beteen the mean error of all training points and the idth of the ε-tube Furthermore, the constant of proportionality may be set exactly using the training constant ν EXPERIMETAL RESULTS All code for this experiment as ritten in C++ and compiled using DJGPP on a GHz Pentium III ith 5MB of memory, running Windos When considering modified ν-svrs e have exclusively used form 6) auto-biasing, quadratic empirical risk) for reasons of implementational simplicity To evaluate the effectiveness of our algorithm, folloing [5] e have chosen the task of using regression to estimate a noisy sinc function from samples x i,z i ), here x i is dran uniformly from [3,3], z i = sinπxi) πx i + v i, and v i is dran from a Gaussian ith zero mean and variance σ Figure 3 Modified ν-svr ith ν = ε = 9), σ = Unless otherise stated, = 5, C =, σ = and K x,y) = e xy Figures and sho the results achieved using a modified ν-svr ith ν = and ν = 8, respectively When ν =, the algorithm automatically sets ε = 7, and hen ν = 8, e find ε = 9 This demonstrates the similarities beteen the role of ν in standard and modified ν-svr methods Figures 3 and 4 sho the results achieved using a modified ν-svr ith ν = hen dealing ith training data ith noise σ = figure 3) and σ = figure 4) In both cases, the tube idth ε is able to adjust automatically to fit the noise present in the problem, just as occurs in the standard ν-svr approach For comparison, figures 5 and 6 sho the result achieved using a standard ε-svr method ith ε =, hich is not optimal in either case In figure 5, the estimate is biased, and in figure 6, ε does not match the noise cf [5]) Discussion

6 Figure 4 Modified ν-svr ith ν = ε = 88), σ = From these results, and [5], it ould appear that ν-svr and modified ν-svr methods exhibit many similar characteristics, and produce results that are essentially identical to one another It is reasonable then to ask hy e feel it is reasonable to suggest modifying ν-svr methods in the first place? The anser, e believe, lies in the relative simplicity of the dual forms of the modified ν-svr methods 9), 3) and 6) hen compared ith the standard ν-svr method 5) It is our experience that the presence of the inequality constraint T α + ) Cν significantly complicates the task of solving the dual optimisation problem This difficulty is particularly noticeable hen attempting to develop an incremental approach to the problem of ν-sv regression Our results suggest that it is possible to achieve results comparable ith those achieved using ν-svr methods ithout the added computational complexity typically associated ith such methods Figure 5 ε-svr ith ε =, σ = COCLUSIO We have presented a modified ν-svr algorithm hich, like standard ν-svr methods, is able to automatically select the tube ith ε Hoever, e have achieved this ithout the additional computational complexity usually associated ith ν-svr methods We have demonstrated on a sample problem that our method is able to achieve results hich are essentially identical to the standard ν-svr method, demonstrating that this loss of computational complexity has not been accompanied by any significant degradation in regressor performance REFERECES [] C Cortes and V Vapnik, Support Vector etorks, Machine Learning, vol, pp 73-97, 995 [] D Lai, M Palanisami, Mani, Fast Linear Stationarity Methods For Automatically Biased Support Vector Machines, IJC3 [3] O Mangassarian, D Musciant, Successive Overrelaxation for Support Vector Machines, IEEE Transactions on eural etorks, vol, Issue 5, Sept 999 [4] J Mercer, Functions of Positive and egative Type, and their Connection ith the Theory of Integral Equations, Transactions of the Royal Society of London, Series A, Vol 9, October 99 [5] B Scholkopf, P Bartlett, A Smola, R Williamson, Shrinking the Tube: A e Support Vector Regression Algorithm, Advances in eural Information Processing Systems, vol, pp , Figure 6 ε-svr ith ε =, σ =

7 Minerva Access is the Institutional Repository of The University of Melbourne Author/s: Shilton, A; Palanisami, M Title: Modified nu-sv Method for Simplified Regression Date: 4 Citation: Shilton, A, & Palanisami, M 4) Modified nu-sv Method for Simplified Regression, in Proceedings, International Conference on Intelligent Sensing and Information Processing, Chennai, India Publication Status: Published Persistent Link: File Description: Modified nu-sv Method for Simplified Regression Terms and Conditions: Terms and Conditions: Copyright in orks deposited in Minerva Access is retained by the copyright oner The ork may not be altered ithout permission from the copyright oner Readers may only donload, print and save electronic copies of hole orks for their on personal non-commercial use Any use that exceeds these limits requires permission from the copyright oner Attribution is essential hen quoting or paraphrasing from these orks

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department