Department of Electrical and Computer Systems Engineering

Size: px

Start display at page:

Download "Department of Electrical and Computer Systems Engineering"

Theodore Weaver
5 years ago
Views:

$MECSE-- A monomial $\nu$-sv method for Regression A. Shilton, D.Lai and M.$

1 Deartment of Electrical and Comuter Systems Engineering Technical Reort MECSE-- A monomial $\nu$-sv method for Regression A. Shilton, D.Lai and M. Palaniswami

2 A Monomial ν-sv Method For Regression A. Shilton, D. Lai, M. Palaniswami, Senior Member, IEEE, Abstract In the resent aer we describe a new formulation for Suort Vector regression SVR, namely monomial ν-svr. Like the standard ν-svr, the monomial ν-svr method automatically adjusts the radius of insensitivity the tube width, ǫ to suit the training data. However, by relacing Vanik s ǫ-insensitive cost with a more general monomial ǫ-insensitive cost and likewise relacing the linear tube shrinking term with a monomial tube shrinking term, the erformance of the monomial ν-svr is imroved for data corruted by a wider range of noise distributions. We focus on the quadric form of monomial ν-svr and show that the dual form of this is simler than the standard ν-svr. We show that, like Suykens Least-Squares SVR LS-SVR method and unlike standard ν-svr, the quadric ν-svr dual has a unique global solution. Comarisons are made between the asymtotic efficiency of our method and that of standard ν-svr and LS-SVR which demonstrate the sueriority of our method for the secial case of higher order olynomial noise. These theoretical redictions are validated using exerimental comarisons with the alternative aroaches of standard ν-svr, LS-SVR and weighted LS-SVR. I. ITRODUCTIO Suort Vector regressors SVRs [] [] [] are a class of non-linear regressors insired by Vanik s SV methods for attern classification [] []. Like Vanik s method, SVRs first imlicitly ma all data into a usually higher dimensional feature sace. In this feature sace, the SVR attemts to construct a linear function of osition that mimics the relationshi between inut osition in feature sace and outut observed in the training data by minimising a measure of the emirical risk. To revent overfitting a regularisation term is included to bias the result toward functions with smaller gradient in feature sace. Two major advantages that SVRs have over cometing methods unregularised least-squares methods, for examle are sarseness and simlicity [] [6]. SVRs are able to give accurate results based only on a sarse subset of the comlete training set, making them ideal for roblems with large training sets. Moreover, such results are achievable without excessive algorithmic comlexity, and use of the kernel trick makes the dual form of the SVR roblem articularly simle. Roughly seaking, SVR methods may be broken into ǫ-svr [] [] and ν-svr methods [7] [8], both of which require a-riori selection of certain arameters. Of articular interest is the ǫ or ν in ν-svr methods arameter, which controls the sensitivity of the SVR to resence of noise in the training data. In both cases, this arameter controls the threshold ǫ directly for ǫ-svr, indirectly for ν-svr of insensitivity of the cost function to noise through use of Vanik s ǫ-insensitive loss function. The standard ǫ-svr aroach is associated with a simle dual roblem, but unfortunately selection of ǫ requires knowledge of the noise resent in the training data and its variance in articular which may not be available [9]. Conversely, the standard ν-svr method has a more comlex dual form, but has the advantage that selection of ν requires less knowledge of the noise rocess [9] only the form of the noise is required, not the variance. Thus both forms have certain difficulties associated with them. Yet another aroach is that of Suykens LS-SVR [], which uses the normal least-squares cost function with an added regularisation term insired by Vanik s original SV method. The two main advantages of this aroach are the simlicity of the resulting dual cost function, which is even simler than ǫ-svr; and having one less constant to choose a-riori. The disadvantages include loss of sarsity and robustness in the solution. These roblems may be ameliorated somewhat through use of a weighted LS-SVR scheme []. However, while this method is noticeably suerior when extreme outliers are resent in the training data, in our exerience the erformance of the weighted LS-SVR may not be significantly better than the standard LS-SVR if such outliers are not resent. In view of the shortcomings of these aroaches, we resent a modification of Smola s ν-svr method monomial ν-svr. Our aroach retains the feature that ν may be selected without knowledge of the variance of the noise resent in the training data. For the secial case of quadric ν-svr second order monomial ν-svr [], the associated dual otimisation roblem is simler than the standard ν-svr method. Furthermore, we show that quadric ν-svr method is able to out-erform both standard ν-svr and LS-SVR weighted or otherwise in several cases for examle in the resence of higher order olynomial noise. We begin in section II by reviewing the standard ǫ-svr method and its roerties. ext, using the theory of maximumlikelihood estimation as motivation, we resent a modification of this method using a new monomial ǫ-insensitive cost function monomial ǫ-svr. Concentrating on the quadric form of this new cost function, we form the dual and show that it is no more comlex than the standard ǫ-svr dual. In subsection II-C we consider the asymtotic efficiency of our method in comarison A. Shilton and M. Palaniswami are with the Centre of Exertise on etworked Decision and Sensor Systems, Deartment of Electrical and Electronic Engineering, The University of Melbourne, Victoria, Australia {ash,swami}@ee.mu.oz.au. D. Lai is currently with the Deartment of Electrical and Comuter Systems Engineering, Monash University, Victoria 68, Australia daniel.lai@eng.monash.edu.au

3 to the standard ǫ-svr and LS-SVR. We also address the issue of selecting ǫ to maximise this efficiency. Finally, in subsection II-D, we analyze the sarsity of the various methods. In section III we further exlore the roerties of the standard ν-svr method, and in articular its roerty of insensitivity to the variance of noise resent in the training data. We then aly the obvious extension of this aroach to the monomial ǫ-svr formulation introduced reviously to roduce a monomial ν-svr method, and show that the dual form of the quadric ν-svr is actually less comlex than the dual form of the standard ν-svr. We then consider the roblem of otimal ν selection for both our method and standard ν-svr, and show that the roerty of noise variance insensitivity when selecting this constant carries over from the standard ν-svr to monomial ν-svr. Finally, in subsection III-C, we consider the issue of sarsity for the standard ν-svr, monomial ν-svr and LS-SVR methodologies. In order to gain a better feel for the roblem, in section IV we consider the articular case of training data affected by olynomial noise of degree 6. In articular, we comare the theoretical efficiencies and the otimal values of ǫ and ν for our method against other aroaches. Finally, in section V, we consider a model roblem where various orders of olynomial noise are added to the training data, and comare the results achieved here with our theoretical redictions. We show that the results fit the rediction within tolerable accuracy. Moreover, we show that the redictions for the arameter ν rovide a worthwhile first guess of the actual otimal value. II. ǫ-sv REGRESSIO The regression roblem may be formulated as follows. Given a training set: Θ = {x,z,x,z,...,x,z } x i R dl z i R where z i = ĝ x i + noise for some ĝ : R dl R; and x i is drawn i.i.d. manner from an unknown distribution, construct an aroximation g : R dl R of ĝ. An aroximation g constructed for a given training set Θ is called a trained machine, and the construction rocess training. We assume that all noise sources eg. measurement noise, system noise etc. are smooth, i.i.d and zero mean. In the SV aroach [], it is usual to define imlicitly, as will be seen later a set of functions ϕ j : R dl R, j d H, which collectively form a ma from inut sace to feature sace, ϕ : R dl R dh, where ϕx = ϕ x,ϕ x,...,ϕ dh x. Using this ma, the trained machine is defined to be: g x = w T ϕ x + b which is a linear function of osition in feature sace. In the ǫ-svr framework, w and b are selected to minimise the regularised risk functional: R w,b = wt w + C g x i z i ǫ x i,z i Θ where. ǫ = max. ǫ, is Vanik s ǫ-insensitive loss function ǫ is a constant. In this exression, the first term wt w characterises the comlexity of the model while the second term is a measure of emirical risk associated with the training set when this model is alied. The constant C > controls the trade-off between emirical risk minimisation and otential overfitting if C is large and comlexity minimisation and otential underfitting if C is small. An imortant roerty of is that errors of magnitude less than ǫ do not contribute to the cost, R w,b. Assuming ǫ is well matched to the noise resent in the training data an issue we will return to shortly, this should lend a degree of noise insensitivity to the cost function []. For convenience, is usually exressed in terms of non-negative slack variables ξ, ξ. Using this notation, the rimal form of the ǫ-svr training roblem is: min w,b,ξ,ξ R w,b,ξ,ξ = wt w + C ξ i + ξi i= such that: w T ϕ x i + b z i ǫ ξ i w T ϕ x i + b z i ǫ ξi ξ,ξ For reasons of mathematical tractability, it is usual to deal with the dual of, which may be constructed as follows. For each of the inequality constraints in we associate a non-negative Lagrange-multilier, resectively α i, α i, γ i and γ i i, noting that this gives a - corresondence between the training air x i,z i and the Lagrange multiliers α i the larger wt w, the larger the gradient of g x in feature sace, and hence the more g x may vary for a given variation in inut, x

4 and αi for all i, and furthermore that at least one of α i and αi techniques it is straightforward to show that the dual form of is: will be zero for all i. Using the usual min α,α L α,α = α α T Gα α α α T z + ǫα + α T such that: α C α C T α α = where G i,j = K x i,x j = ϕ x i T ϕx j, and K : R dl R dl R is the kernel function associated with the ma ϕ : R dl R dh. The trained machine g x may be written in terms of the kernel function: g y = i α i α i K x i,y + b The bias b is calculated indirectly, using the fact that g x i = z i ǫ for all i such that < α i < C and likewise g x i = z i + ǫ for all i such that < αi < C. At no oint during training or use of the trained machine is knowledge of the exact form of ma ϕ : R dl R dh required. Only the kernel function K : R dl R dl R is required, and any symmetric function K : R dl R dl R satisfying Mercer s condition [] can be shown to be sufficient for the task []. oting that each training vector corresonds to one air α i, αi, and that one of these will be zero for all i, we may divide our training vectors into three distinct classes, namely: on-suort vectors: α i = αi =, ξ i = ξi =. Boundary vectors: α i αi [ C, ] C \{}, ξi = ξi =. Error vectors: α i = C, ξ i > or αi = C, ξ i >. Suort vectors are any vectors which contribute to i.e. both boundary and error vectors. We define S to be the number of suort vectors in the training set, E the number of error vectors and B the number of boundary vectors; so S = B + E. on-suort vectors are said to lie inside the ǫ-tube, boundary vectors on the edge of the ǫ-tube and error vectors outside the ǫ-tube. We will require the following theorem later: Theorem : Theorem., []: If the distribution from which the measured oututs {z,z,...,z } are drawn is smooth then: lim S = lim E A. Monomial ǫ-sv Regression Consider when ǫ =. If C is sufficiently large, and assuming that the emirical risk is non-zero, the second term the emirical risk term will be much larger than the first term the regularisation term. Hence, in this case: R w,b C g x i z i x i,z i Θ But this is just the Max-log-likelihood ML cost function for data affected by Lalacian noise []. It has been shown [] that, under certain assumtions given in section II-C, the otimal value ǫ ot for ǫ when the outut is affected by Lalacian noise is. For other tyes of noise it is often found that ǫ ot, as the emirical risk comonent of the cost function does not corresond to the ML cost in such cases. The resence of ǫ allows us to achieve a degree of noise insensitivity even though the cost function does not corresond the ML cost function. The question raised by these observations is whether one may achieve better erformance in the SVR for non-lalacian noise by modifying the rimal cost function to match the ML cost function when C is large. However, when doing so we must be mindful of the effect any changes may have on the mathematical tractability of the roblem, as mathematical simlicity is one of the major strengths of the SV aroach. One variant of standard SVR which is known to have a articularly simle dual form is Suykens least squares LS SV method []. This uses the following modified rimal cost function: R LS w,b = wt w + C x i,z i Θ g x i z i If C is sufficiently large the second emirical risk term in this exression will be dominant. But the emirical risk term in is just the ML cost function for training data affected by Gaussian noise, so if C is sufficiently large then R LS w,b will corresond aroximately to the ML cost function for training data affected by Gaussian noise. Hence one would exect the

5 LS-SVR to erform well in the resence of Gaussian noise. However, if the noise is not Gaussian, the lack of ǫ-insensitivity in the rimal is likely to make the LS-SVR excessively sensitive to noise. Motivated by this, we roose the following modification of the standard ǫ-svr rimal: R q w,b = wt w + C q x i,z i Θ g x i z i q ǫ 6 where q Z + is a constant. If C is large and ǫ =, the second emirical risk term will be dominant and hence R q w,b will corresond aroximately to the ML cost function for degree q olynomial noise, where olynomial noise of degree q is characterised by the density function τ = ce d τ q, where c,d > are constants. If q =, 6 reduces to the standard ǫ-sv cost function. Similarly, if q = and ǫ =, 6 reduces to rimal form of Suykens LS-SVR. However, like the standard ǫ-sv cost function and unlike the LS-SVR cost function, 6 incororates ǫ-insensitivity to achieve noise insensitivity when the emirical risk comonent of the cost function does not match the noise rocess effecting the training data. In terms of the usual slack variables, 6 may be written: min w,b,ξ,ξ R q w,b,ξ,ξ = wt w+ C q ξ q i + ξ q i i= such that: w T ϕx i + b z i ǫ ξ i w T ϕ x i + b 7 z i ǫ ξi ξ,ξ We shall refer to a regressor of form 7 as a monomial ǫ-svr as the emirical risk term is a monomial function of Vanik s ǫ-insensitive cost. Unfortunately, for the general case q > the dual form of 7 is rather comlicated []. For this reason we will restrict ourselves to the secial case q = quadric ǫ-svr, in which case it turns out that the dual roblem is mathematically nice. Before we construct the dual form of 7 we show that if q = the ositivity constraints ξ,ξ in 7 are suerfluous, giving us the simlified rimal roblem: min w,b,ξ,ξ R w,b,ξ,ξ = wt w+ C ξ i + ξi i= such that: w T ϕx i + b 8 z i ǫ ξ i w T ϕx i + b z i ǫ ξi We have the following theorems: Theorem : For every solution w,b,ξ,ξ of 8, ξ,ξ. Proof: Suose there exists a solution w, b, ξ, ξ of 8 such that ξi < for some i. Then for all other w,b,ξ,ξ satisfying the constraints contained in 8, R w,b,ξ,ξ R w, b, ξ, ξ by definition. Consider w,b,ξ,ξ, where w = w, b = b, ξ = ξ and: { ξj if i j ξ j = otherwise First, note that as w T ϕ x i + b z i ǫ ξ i, w = w, b = b, ξ i < and ξ i =, it follows that w T ϕ x i + b z i ǫ ξ i. Hence w,b,ξ,ξ satisfies the constraints in 8. Second, note that: R w, b, ξ, ξ = R w,b,ξ,ξ + C ξ i R w,b,ξ,ξ < R w, b, ξ, ξ These two observations contradict the original assertion that w, b, ξ, ξ with ξi < for some i was a solution of 8. Hence, for all solutions w,b,ξ,ξ of 8, ξ. The roof of the non-negativity of ξ follows from an analogous argument for the elements of this vector. Theorem : Any solution w,b,ξ,ξ of 8 will also be a solution of 7 when q =, and vice-versa. Proof: This follows trivially from theorem. Using 8 it is straightforward to construct the dual form of 7 when q = via the usual method. The dual is: min α,α L α,α = α α T Gα α + C αt α + C α T α α α T z + ǫα + α T such that: α,α T α α = As an alternative to the aroach resented here, this roblem may be tackled by using a weighted LS-SVR method []. This involves using a two-ste rocess. First, a standard LS-SVR is constructed. Based on this, weights are calculated for each training air. These weights are subsequently used in a second weighted LS-SVR, the training of which results in the trained machine. 9

6 where G is as before. The trained machine takes the same form as. ote that the only constraints in 9 are ositivity constraints on the dual variables, α and α, and a single equality constraint. Furthermore, ξ = C α. B. Training Issues To clearly see the structure of the monomial ǫ-svr dual roblem 9, it is instructive to re-exress it as follows: [ ] T [ ] [ ] T min α,α L q α,α = α α α α Q α + α s such that: α,α T α α = where: [ G G Q = [ G ] G ǫ z s = ǫ + z ] + C I Once we convert the roblem into this form, it is essentially trivial to aly standard SVM training techniques for examle [6], [7] to the roblem. We also have the following useful roerty which our formulation shares with Suykens LS-SVR method: Theorem : The dual roblem has a unique global solution. Proof: It follows from age 79 of [8] that in order to rove that has a unique solution it is sufficient to rove that Q is ositive definite. Since we are using a Mercer kernel, G is ositive [ semidefinite. ] Given that C >, it follows that C I G G is ositive definite. We also know that, for G ositive semidefinite, must be ositive semidefinite. Hence Q G G must be ositive definite, and must have a unique global solution. As an aside, note that if G is not ositive semidefinite to working recision, Q may still be made ositive definite using the Levenberg-Marquardt [9] [] method by choosing C aroriately. C. Asymtotically Otimal Selection of ǫ In [], Smola describes how, based on certain assumtions, the arameter ǫ may be selected in an otimal fashion for the standard ǫ-svr. While the assumtions made in this aer are not met by the SVR, exerimental results suggest that regardless of this the redictions made are reasonably accurate, and certainly rovide a useful first guess of the otimal value for ǫ. More imortantly, Smola derives a relationshi between the variance of the measurement noise and the efficiency of the training machine if the machine is trained using a given ǫ. Smola s and our assumtions are: The training set is infinitely large. The function g estimated by the SVR converges to the actual relationshi ĝ. The general SVR model is relaced by an unregularised location arameter estimator. Mathematically, the third assumtion is met in the limit K, C, and imlies that g x = b. Throughout the resent aer, these will be referred to as the usual assumtions. Following [], assume Z = {z,z,...,z } is drawn in an i.i.d. manner from some robability density function z θ with mean θ and variance σ. Let ˆθ Z be an unbiased estimator for θ. The efficiency e of the estimator is defined to be []: e = det IB where I is the Fisher information matrix and B is the covariance matrix []. Loosely seaking, the higher the efficiency, the better the estimator as there is less variance in the estimate of the mean θ. By otimal selection of ǫ, we mean choosing ǫ to maximise the efficiency, e. The value which achieves this maxima will be written ǫ ot. The Cramer-Rao bound [] states that e. For a single arameter estimator of the form: ˆθ Z = arg min ˆθ z Z d z, ˆθ where d z, ˆθ is a iecewise twice differentiable function of ˆθ then, asymtotically [], lemma : e = Q IG

7 6 where []: I = z ln z θ θ z θdz G = z Q = z dz,θ θ z θdz dz,θ θ z θdz Making the usual assumtions, the monomial ν-svr has form. That is: d z, ˆθ = z ˆθ where ˆθ = b and q Z +. Define std τ and q std τ to be, resectively, the normalised zero mean, unit variance and symmetrised normalised distribution functions. That is: q std τ = std τ + std τ std z = σ σz θ θ Then, denoting = ǫ σ : Consequently: where: I = σ G = = Q = = q ǫ ln std τ τ std τ dτ dz,θ θ z θdz z R\[θ ǫ,θ+ǫ] = σ q = σ q dz,θ θ z R\[θ ǫ,θ+ǫ] = σ q τ R\[,] τ R\[,] d z θ ǫ z θ ǫ σ std d τ τ z θ σ dz std τ dτ τ q q std τdτ z θdz d z θ ǫ z θ ǫ σ std y θ σ dz d τ τ std τdτ = σ q { qstd if q = q τ q q std τ dτ if q e = I std q std q stdτdτ q τ q q std τdτ τ q q std τdτ I std = ln std τ τ std τdτ if q = if q is the efficiency of the monomial ǫ-svr under the usual assumtions. The otimal value ǫ ot of the arameter ǫ is defined to be the value which maximizes this efficiency thereby minimising the variance in the estimate of b. Defining: ot = arg min e it is clear that maximum efficiency will be achieved by setting ǫ = ǫ ot = ot σ. This imlies that ǫ ot is directly roortional to the noise variance σ, where the constant of roortionality is deendent on the tye of noise. If the tye and amount variance of the noise are both known, rovides a reasonable basis for selecting ǫ or at least a reasonable first guess. If q =, can be used to obtain the theoretical efficiency of Suykens LS-SVR, under the usual assumtions, by setting ǫ =. For later reference, we define e LS to be this efficiency, i.e.: e LS = q stdτdτ I std τ q std τdτ

8 7 D. Sarsity of the monomial ǫ-svr As discussed reviously, art of our motivation for develoing the framework for monomial ǫ-svr is Suykens LS-SVR technique. Indeed, if C is sufficiently large and the noise affecting measurements Gaussian, LS-SVR corresonds aroximately to the ML estimator for the arameters w and b. However, a downside of the LS-SVR aroach is the lack of sarsity in the solution [], where by sarsity we are referring here to the number of non-zero elements in the vector α α or, equivalently as see theorem, the fraction of training vectors that are also suort vectors. In the LS-SVR case, all training vectors are suort vectors, i.e. S =. While the sarsity roblems of the LS-SVR may be overcome to a degree using the weighted LS-SVR aroach [], this aroach is somewhat heuristic in nature, requiring some external arbiter either human or machine to decide when to cease runing the dataset. However, by maintaining the ǫ-insensitive comonent of the cost function, the need for such heuristics at least at this level of abstraction is removed. So long as ǫ >, we would exect that the solution α α to the monomial ǫ-svr dual roblem will contain some non-zero fraction of non-suort vectors. Under the usual assumtions, we have the following theorem: Theorem : The fraction of suort vectors found by the ǫ-svr under the usual assumtions will, asymtotically, be: lim S = q std τdτ where = ǫ σ. Proof: Given that errors vectors are those for which g x i z i ǫ > i.e. those lying outside the ǫ-tube, using theorem it can be seen that: lim S = lim E = Pr g x ĝ x > ǫ = z θdz z R\[θ ǫ,θ+ǫ] = std τdτ where = ǫ σ. ote that this imlies that: z R\[,] = lim S = ǫ q std τdτ as exected for LS-SVR. It also imlies that any decrease in ǫ is likely to lead to a decrease in the sarsity of the solution. So, in general, if the training set is large then ǫ should be chosen to be as large as ossible while still maintaining accetable erformance to maximise the sarsity of the solution. III. ν-sv REGRESSIO The major drawback of ǫ-svr is aarent in the relation ǫ ot = ot σ. Secifically, selection of ǫ requires knowledge of what tye of and how much noise is resent in the training set or alternatively we must have a secific intended accuracy to use as a basis for selecting ǫ. However, while we may have some idea of the tye of noise we can exect to be dealing with for a given roblem and hence be able to calculate ot, we are unlikely to know how much noise will be resent, leaving σ and therefore ǫ ot uncertain. To overcome this roblem, Scholkof et. al. [7] introduced the ν-svr formulation of the roblem, which includes an additional term in the rimal roblem to trade-off the tube size ǫ, no longer a constant against model comlexity and emirical risk. From [7], the rimal formulation of the standard ν-svr training roblem is: min R,... = w,b,ξ,ξ wt w + Cνǫ + C ξ i + ξi,ǫ i= such that: w T ϕ x i + b z i ǫ ξ i w T ϕ x i + b z i ǫ ξi ξ,ξ ǫ 6 Of course, some heuristic inut will still be required for each aroach, either for ǫ selection or when deciding how much comromise is accetable during runing. The advantage of the former is that there exist alternative criteria which may be used to select ǫ i.e. otimal erformance for a given noise model, with sarsity roerties being just a useful side affect of this choice. If the dataset is very large, however, sarsity may be of rimary imortance, in which case neither aroach will have a clear advantage.

9 8 where ν > is a constant. The associated dual is [7]: min α,α L, α,α = α α T Gα α α α T z such that: α C α C T α α = T α + α Cν 7 where G is as before. We require the following theorem: Theorem 6: Theorem., []: For a ν-svr trained with any training set of size : ν is an uer bound on the fraction of error vectors. ν is a lower bound on the fraction of suort vectors. From this theorem we can immediately see that: ν = lim S A. Monomial ν-sv Regression We shall demonstrate shortly that there is a direct functional relationshi between ν and the theoretical efficiency e under the usual assumtions that is indeendent of σ. This means that it is ossible to find ν ot to achieve maximum theoretical efficiency without knowing the noise variance σ although the tye of noise is still required. However, the rice we ay for this is increased comlexity in the dual formulation 7, secifically the resence of a new constraint, T α + α Cν. Also, like standard ǫ-svr, the emirical risk comonent of the standard ν-svr rimal corresonds to ML only for Lalacian noise. To deal with these issues, we introduce the following formulation: min R q,r... = w,b,ξ,ξ wt w + Cν r ǫr + C q ξ q i + ξ q i,ǫ i= such that: w T ϕ x i + b z i ǫ ξ i w T ϕ x i + b z i ǫ ξi ξ,ξ ǫ 8 where q,r Z + are constants. We shall refer to this as monomial ν-svr when q = r. This reduces to the standard ν-svr rimal 6 if we choose q = r =. We will be concentrating on the case q = r = quadric ν-svr. We have already shown in theorem which will hold here also that the ositivity constraints ξ,ξ are suerfluous if q =. We now show that the constraint ǫ is also suerfluous if r =, giving us the simlified rimal roblem when q = r = : min R,... = w,b,ξ,ξ wt w + Cν ǫ + C ξ i + ξi,ǫ i= such that: w T ϕ x i + b 9 z i ǫ ξ i w T ϕ x i + b z i ǫ ξi We have the following theorems: Theorem 7: For every solution w,b,ξ,ξ,ǫ of 9, ǫ. Proof: Suose there exists a solution w, b, ξ, ξ, ǫ of 9 such that ǫ <. Then: R, w, b, ξ, ξ, ǫ = R, w, b, ξ, ξ, + Cν ǫ R, w, b, ξ, ξ, < R, w, b, ξ, ξ, ǫ Furthermore, if the constraints in 9 are satisfied for w, b, ξ, ξ, ǫ they must also be satisfied for w, b, ξ, ξ,. Therefore w, b, ξ, ξ, ǫ does not minimise R, subject to the constraints and hence cannot be a solution of 9, which contradicts the original assertion. So we conclude that every solution w,b,ξ,ξ,ǫ of 9 must satisfy ǫ. Theorem 8: Any solution w,b,ξ,ξ,ǫ of 9 will also be a solution of 8 when q = r =, and vice-versa. Proof: This follows trivially from theorems 7 and.

10 9 It is not difficult to show that the dual form of 9 and hence 8 with q = r = is: min α,α L, α,α = α α T Gα α + α + α T Cν Eα + α C αt α + C α T α α α T z such that: α,α T α α = where E is a matrix where all elements are. The trained machine takes the same form as. Furthermore: from which we see that: ǫ = ν ǫ = Cν T α + α ξ = C α x i,z i Θ g x i z i ǫ ote that the only constraints in are ositivity constraints on the dual variables, α and α, and a single equality constraint. By comarison, the standard ν-svr dual 7 which is otherwise of comarable comlexity has uer bounds on these variables and an uer bound on T α + α. For training uroses, note that may be exressed identically to by setting: [ G G Q = [ G] G z s = z ] + Cν E + C I Like monomial ǫ-svr, this form will have a unique global minimum to see this, note that the additional term Cν E is ositive semidefinite and therefore will not affect the validity of theorem. Finally, note that in the limit ν it is clear from that if q = r =, ǫ. In fact, it is clear from the general form of the monomial ν-svr 8 that as ν we must have ǫ to ensure that the rimal cost R, is finite. This imlies that in the limit ν the form of the quadric ν-svr aroaches the form of the LS-SVR. B. Asymtotically Otimal Selection of ν In section II-C, we showed that for ǫ-svr the theoretical efficiency e is a function of = ǫ σ. We now show that e may also be exressed somewhat indirectly as a function of ν making the usual assumtions, indeendent of σ if q = r. The advantage of this form is that, unlike ǫ ot, calculation of ν ot the value of ν which results in maximum efficiency does not require knowledge of σ. Consider the regularised cost function in its rimal form: min R q,r... = w,b,ξ,ξ wt w + Cν r ǫr + C q ξ q i + ξ q i,ǫ i= such that: w T ϕx i + b z i ǫ ξ i w T ϕx i + b z i ǫ ξi ξ,ξ ǫ where the ositivity constraints on ǫ, ξ and ξ may or may not be suerfluous; and r,q Z +. First, we note that: R q,r ǫ = C νǫ r + ξ q ξ i i ǫ + ξ q ξ i i ǫ where : i= { ξ i if g ǫ = xi > z i ǫ or g x i = z i ǫ,δǫ > { otherwise ξ i if g ǫ = xi < z i + ǫ or g x i = z i + ǫ,δǫ > otherwise We have taken some liberties here when calculating this derivative, which is not actually well defined. However, the rate of change as ǫ is either increased or decreased is well defined, which is what we are indicating here.

11 and hence: νǫ r S q =,δǫ < νǫ ǫ = C r E q =,δǫ > νǫ r ξ q i + ξ q i otherwise i= νǫ r S q =,δǫ < νǫ = C r E q =,δǫ > νǫ r z i g x i q ǫ otherwise R q,r i= We aim to select ν to arrive at a secific ǫ. In order to achieve this, the otimality condition Rq,r ǫ = must be met for this articular value of ǫ. In other words, if q we require that: ν = ǫ r ξ q i + ξ q i i= = ǫ r zi w T ϕx i + b q Alying the usual assumtions g x = b,, we find that in all cases : [ ν = σ q r ] r τ q q std τdτ where, as usual, = ǫ σ, and q std τ is defined by. If q = r then: i= ǫ ν = q τ q q std τdτ This imlies that if q = r then ν may be selected to achieve a articular efficiency e select assuming said efficiency is achievable by finding required to achieve this and then substituting this into the relevant exression for ν. At no oint is it necessary to use σ, so only the noise tye is required. ote that if we select q = r =, we can retrieve Smola s original result []: ν = std τdτ from which see that for a standard ν-svr if ν = then, in the limit, = ǫ =. Generally, if q = r then in order to achieve maximum efficiency we should choose: ν ot = q ot τ ot ot q q std τdτ where: C. Sarsity of the Monomial ν-svr ot = arg min We have shown in theorem that, under the usual assumtions: lim S = e q std τdτ which is a - function on if the distribution q std τ is ositive on τ. We have also shown in the revious section that again using these assumtions if q = r then from : ν = q τ q q std τdτ which is also a - function on if q std τ is ositive on τ. These two functions demonstrate that in general there exists a direct relation between the asymtotic value of the fraction of suort vectors in the training set and hence the sarsity of the solution and the arameter ν. In general, the If q = then in the limit becomes: R lim,r ǫ = C νǫ r lim S if δǫ < νǫ r lim E if δǫ > which is well defined, as theorem states that lim S = lim E R. For otimality, q,r =, and so using theorem it follows that, if q = : ǫ ν = σ r [ r q std τdτ ]

12 TABLE I OPTIMAL ǫ AD ν FOR STADARD AD QUADRIC LABELLED S AD Q, RESPECTIVELY ǫ-svr AD ν-svr METHODS WITH POLYOMIAL σ ADDITIVE OISE OF DEGREE 6, AD ASYMPTOTIC SUPPORT VECTOR RATIO S AT OPTIMALITY. Polynomial degree 6 ǫ Otimal S σ Otimal ν S lim S S ǫ Otimal Q σ Otimal ν Q lim S Q exact nature of this relation will be deendent of the form of the noise rocess affecting the training data. In the secial case q = r = i.e. standard ν-svr, we see that: ν = q std τdτ and so: which coincides with the asymtotic form of theorem 6. ν = lim S IV. PERFORMACE I THE PRESECE OF POLYOMIAL OISE To gain a better understanding of the method, we will consider a articular examle where the training data is affected by additive noise that is olynomial in nature this includes Gaussian and Lalacian noise as secial cases = and = resectively. For olynomial noise, std τ = c e c τ, Z +, where: c = Γ Γ Γ c Γ = Γ We will consider the related questions of otimal arameter selection and sarsity, comaring Smola s standard ν-svr, our monomial ν-svr and Suykens LS-SVR method. A. Asymtotically Otimal Selection of ǫ and ν Using and it is not difficult to show that under the usual assumtions, for a monomial ν-svr, denoting the efficiency of an order q monomial SV R for a training set affected by olynomial noise of order as e,q under the usual assumtions, and likewise the connection between ν and as ν,q σ,: e c ℸ, if q = where we have defined: e,q = Γ q c ℸ q,c ; ℸ q,c ; if q Γ e LS = Γ [ Γ c ν,q σ, = ℸ q q, c ; ] 6 c ℸ m, β; = β τ m e βτ dτ = m m i m i β i Γ i+,β for m Z\Z and Z + ; and Γa,x is the comlementary incomlete gamma function []: i= Γa,x = x ta e t dt Table I shows the otimal values for ǫ and ν for olynomial noise of degree 6 for both standard and quadric ǫ-svr and ν-svr source []. Unsurrisingly, the otimal value for ǫ for the quadric ǫ-svr in the resence of Gaussian noise = is, as in this case the quadric ǫ-svr rimal and the maximum-likelihood estimator both take aroximately 7

13 = = = = = = Fig.. Comarative inverse asymtotic efficiency versus ǫ of standard ǫ-svr, quadric ǫ-svr and LS-SVR for olynomial noise of degrees 6. σ In all cases, the solid line reresents the efficiency of quadric ǫ-svr, the dotted line the efficiency of standard ǫ-svr, and the dashed line the efficiency of the LS-SVR. the same form if ǫ = and C is large just as ǫ ot = for standard ǫ-svr in the resence of Lalacian noise =. ote also that, for the cases reresented in the table, ǫ ot > for all > q and ǫ ot = for all q. Generally: Theorem 9: Under the usual assumtions, given a set of training data affected by olynomial noise of degree = q with non-zero variance, the otimal value ǫ ot which maximizes e for the arameter ǫ as defined by will be zero. Theorem : Under the usual assumtions, given a set of training data affected by olynomial noise of degree > q with non-zero variance, the otimal value ǫ ot which maximizes e for the arameter ǫ as defined by will be ositive. Conjecture : Under the usual assumtions, given a set of training data affected by olynomial noise of degree < q with non-zero variance, the otimal value ǫ ot which maximizes e for the arameter ǫ as defined by will be zero. The roofs for theorems 9 and, and a artial roof of conjecture which we have observed exerimentally to be try for all q may be found in aendix I. Figure shows the inverse asymtotic efficiency versus ǫ σ for both standard and quadric ǫ-svrs, as well as LS-SVR. ote that with the excetion of Lalacian noise = the otimal theoretical efficiency of the quadric ǫ-svr exceeds the otimal theoretical efficiency of the standard ǫ-svr. Also note that the efficiency of monomial ǫ-svr methods exceeds that of LS-SVR for all >. Finally, figure shows the inverse asymtotic efficiency versus ν for both standard and quadric ν-svrs, as well as LS-SVR. The imortant thing to note from these grahs is that, although the range of ν is much larger for quadric ν-svr methods than for standard ν-svr methods, the efficiency itself quickly flattens out. In articular, although the theoretically otimal value for Gaussian noise is ν, it can be seen that even when ν = the efficiency is very close to it s maximum, e =. Also note the comarative flatness of the efficiency curves for quadric ν-svr. B. Sarsity Issues In the case of olynomial noise, theorem imlies that for monomial SVRs: lim S = c ℸ, c ; We have already observed that for monomial ν-svrs: ν,q = c c c ℸ q q, c ; Or, in the two cases of articular interest standard q = and quadric q = ν-svr: ν, = c ℸ c, c ; ν, = c c ; c ℸ,

14 = = = = = = Fig.. Comarative inverse asymtotic efficiency versus ν of standard ν-svr, quadric ν-svr and LS-SVR for olynomial noise of degrees 6. In all cases, the solid line reresents the efficiency of quadric ν-svr, the dotted line the efficiency of standard ν-svr and the dashed line the efficiency of the LS-SVR. ote that for all grahs the ν-scale differs for standard and quadric ν-svrs. For standard ν-svrs, the actual setting for ν is one tenth of that indicated on the x axis for quadric ν-svrs, ν is as indicated. In the case q =, as exected, this imlies: lim S = ν The case q = is slightly more comlex, and best illustrated grahically. We do this in figure, which shows our redictions for the fraction of suort vectors found as a function of ν for both standard and quadric ν-svr methods for olynomial noise of degree 6. ote that the general shae of the curves is essentially identical in all cases. Generally, the fraction of suort vectors found by the quadric ν-svr will increase quickly while ν is small and then level out, aroaching as ν as exected, given that the LS case corresonds with ν and treats all vectors as suort vectors. Table I gives the exected asymtotic ratio of suort vectors to training vectors when ν is otimally selected for the usual degrees of olynomial noise. On average, the results given in the table imly that quadric ν-svrs may require aroximately twice as many suort vectors as standard ν-svrs to achieve otimal accuracy on the same dataset. This may be understood by realising that the act of extracting suort vectors is essentially a form of lossy comression. The modified ν-svr is theoretically able to achieve more accurate results than standard ν-svr because it can handle more information by using less comression or, equivalently, finding more suort vectors before over-fitting and subsequent degradation in erformance begins. V. EXPERIMETAL RESULTS Due to the restrictive assumtions made when deriving the results for theoretical efficiency given in the receding sections which will, in the strictest sense, never be satisfied for any SVR, it is imortant that we seek some form of exerimental confirmation of these results. This is our aim in the resent section. For ease of comarison our exerimental rocedure is modelled on that of []. We have numerically comuted the risk in the form of root mean squared error RMSE as a function of ν for both standard and quadric ν-svr methods, allowing clear comarison between the two methods. We also comute the RMSE for Suykens LS-SVR, which is the limiting case of quadric ν-svr as ν, and non-sarse weighted LS-SVR []. Plots of risk versus ν are given for olynomial noise of degree 6 to comare the theoretical and exerimental results, and some relevant results are given for the effect of other arameters on the ν curves. Finally, we comare the sarsity of standard and quadric ν-svr for different orders of olynomial noise 6, and comare these results with the theoretical redictions. As in [], the training set consisted of examles x i,z i where x i is drawn uniformly from the range [,] and z i is given by the noisy sinc function: z i = sincx i + ζ i = sinπxi πx i + ζ i

15 = = = = = = Fig.. Comarative asymtotic fraction of suort vectors versus ν of standard and quadric ν-svr for olynomial noise of degrees 6. In all cases, the solid line reresents the efficiency of modified ν-svr and the dotted line standard ν-svr. ote that for all grahs the ν-scale differs for standard and quadric ν-svrs. For standard ν-svrs, the actual setting for ν is one tenth of that indicated on the x axis for quadric ν-svrs, ν is as indicated. where ζ i reresents additive olynomial noise. The test set consisted of airs x i,z i where the x i s were equally saced over the interval [,] and z i was given by the noiseless sinc function. trials were carried out for each result. Quadric ν-svr, standard LS-SVR and weighted LS-SVR code for this exeriment was written in C++ and comiled using DJGPP on a GHz Pentium III with MB of memory, running Windows 6. Exeriments using standard ν-svr methods were done using LibSVM [6]. By default, we set the arameter C = and noise variance σ =.. In all exeriments we use the Gaussian RBF kernel function K x,y = e σ kernel x y where σkernel = by default. A. Additive Polynomial oise In our first exeriment, we have used training data affected by olynomial noise of degree 6. Plots of RMSE for standard ν-svr, quadric ν-svr, LS-SVR and weighted LS-SVR are shown in figure. With the excetion of Lalacian noise =, these results resemble the theoretical redictions shown in figure. ote that the quadric ν-svr outerforms both standard ν-svr and weighted and unweighted LS-SVR in all cases excet Lalacian noise =. Whilst the general shae of the RMSE curves closely resembles the shae of the redicted efficiency curves, it is imortant to note that the shar otimum redicted by the theory for see figure is not resent in the exerimental results. Instead, the region of otimality is somewhat to the right of this and also somewhat blunter. From an alication oint of view, this is actually an advantage, as it means that results are far less sensitive to ν than exected. Indeed, from these results we can emirically say that the sweet sot for ν lies between. and for olynomial noise of degree and anything above otherwise. But, roughly seaking, selecting ν = will in all cases resented give results suerior to the standard ν-svr method and at least comarable to the LS-SVR methods better if. In the case of Lalacian noise the actual erformance in terms of RMSE of both the quadric ν-svr and the LS-SVRs are better than that of the standard ν-svr, whereas theory would suggest that this should not be the case. We are unsure as to why this anomaly occurs. Figure shows the ratio of suort vectors to training vectors for both standard and quadric ν-svrs as a function of ν. These curves closely match the redictions given in figure. It is clear from figures and that, as exected, the number of suort vectors found by the quadric ν-svr is substantially larger than the number found by the standard ν-svr. However, this is still substantially less than the number of suort vectors found by the LS-SVR which uses all training vectors as suort vectors. B. Parameter Variation with Additive Gaussian oise In our second exeriment, we consider the erformance of the standard and quadric ν-svr in the resence of additive Gaussian noise as other arameters of the roblem articularly σ the noise variance, C and σ kernel. In articular, we wish 6 code available at htt://

16 . =. =. =.. quadric nu SVR standard nu SVR LS SVR Weighted LS SVR = = = Fig.. Risk RMSE versus ν for sinc data with olynomial noise of degree {,,,,, 6} working left to right, to to bottom c.f. [], figure. In all cases, σ =., C = and σkernel =. ote that for all grahs the ν-scale differs for standard and modified ν-svrs. For standard ν-svrs, the actual setting for ν is one tenth of that indicated on the x axis for modified ν-svrs, ν is as indicated. to see if the RMSE versus ν curve retains the same general form as these arameters are varied. The to row of figure 6 shows the form of the RMSE versus ν curve for a range of noise levels. In all cases the form of the curves remains essentially unchanged. It will be noted, however, that for the lowest noise case σ =. the standard ν-svr is able to out-erform the quadric ν-svr. We are unsure as to why this occurs. Once again, selecting ν > gives reasonable erformance in all cases c.f. Smola s result for standard ν-svr [9], where the otimal area is ν [.,.8]. The middle row of figure 6 shows the same curve with the noise variance σ fixed for different values of C, namely C {,,}. Once again, the RMSE curve for quadric ν-svr is roughly as redicted, and ν > gives reasonable results. In this case, C = rovides an anomalous result. Finally, in the bottom row of figure 6, we give the RMSE versus ν curves when the kernel arameter σkernel is varied. These results follow the same attern as for variation of σ and C. It is interesting to note here that, while the RMSE versus ν curve obtained using the quadric ν-svr fits more closely the redicted curve than does the same curve for standard ν-svr when arameters are chosen badly, this does not imly that the erformance of the quadric ν-svr will necessarily be better than the standard ν-svr in this case. Indeed, the two cases where other SV arameters i.e. not ν have been chosen badly i.e. C = and σkernel =. in figure 6 are the two cases where the standard ν-svr most outerforms the quadric ν-svr. However, as one should continue to search until aroriate arameters are found, this should not be too much of a roblem in ractical situations. VI. COCLUSIO In this aer we have reviewed the standard SVR techniques of ǫ-svr and ν-svr. Motivated by the relative merits and demerits of these aroaches, we then introduced a new, more general form of SVR monomial ν-svr. We roceeded to investigate a secial case of monomial ν-svr quadric ν-svr in some detail and showed that the dual form of the quadric ν-svr dual roblem is significantly simler than the standard ν-svr dual. We have comared the theoretical asymtotic efficiencies of our roosed formulation against other SVR methods under certain restrictive assumtions. Our investigation indicates that the quadric ν-svr method not only shares the standard ν-svr method s roerty of theoretical noise variance insensitivity, but is also more efficient in many cases in articular, when the training data is affected by olynomial noise of degree. These redictions have been exerimentally tested, and comarisons have been made between the erformances of quadric ν-svr, standard ν-svr, standard LS-SVR and weighted LS-SVR methods in the resence of higher order olynomial noise. Based on these results, we conclude that the theoretical redictions give a useful insight into the characteristics of the various methods. Both theoretical and exerimental results indicate that erformance of the quadric ν-svr method in many cases exceeds that of both standard ν-svr and LS-SVR.

17 6 = = = = = = Fig.. umber of suort vectors out of training vectors versus ν for sinc data with olynomial noise of degree {,,,,,6} working left to right, to to bottom. In all cases, σ =., C = and σkernel =. The dotted line gives results for standard ν-svr, while the solid line gives results for quadric ν-svr. ote that for all grahs the ν-scale differs for standard and modified ν-svrs. For standard ν-svrs, the actual setting for ν is one tenth of that indicated on the x axis for modified ν-svrs, ν is as indicated.... sigmanoise =. quadric nu SVR standard nu SVR LS SVR Weighted LS SVR... sigmanoise =.... sigmanoise = C = kernel = C = kernel = C = kernel = Fig. 6. Risk RMSE versus ν for sinc data with Gaussian noise c.f. [], figure. The to row shows erformance for different noise variances, σ {.,., } from left to right, with C = and σkernel =. The middle row gives erformance for C {,, }, resectively, with σ =. and σkernel =. Finally, the bottom row shows erformance for σ kernel {.,, }, resectively, with σ =. and C =. ote that for all grahs the ν-scale differs for standard and quadric ν-svrs. For standard ν-svrs, the actual setting for ν is one tenth of that indicated on the x axis for quadric ν-svrs, ν is as indicated.

18 7 APPEDIX I PROOFS OF THEOREMS In this aendix we give roofs for theorems and 9, and a artial roof of conjecture. Before roceeding, however, some reliminary results are required. In what follows, B x, y is the beta function []: B x,y = ΓxΓy Γx+y It is straightforward to show that: { d d ℸ mℸm, β; if m > m, β; = β e β if m = ℸ m, β; = β m Γ m+ Theorem : lim xγax = x + a for all a >. Proof: Using the Euler limit form of Γx [6], it can be seen that: xγax = x [ ax n= + ax ] n + ax n and therefore lim xγax = x + a. Theorem : B a + c,b > B a,b + c for all a > b, a,b,c > ; and B a + c,b < B a,b + c for all a < b, a,b,c >. Proof: From [], section., equation, it can be seen that: B a + c,b = Bc,b B a,b + c = Ba,c Γb Γb+c Γa B a,b + c Γa+c But the gradient ψ x of ln Γx is a monotonically increasing function of x >, and so for all a > b, a,b,c > : Γb Γb+c > Γa ln Γa + c ln Γa > lnγb + c ln Γb and hence Γa+c for all a > b, a,b,c >. Using, it follows that B a + c,b > B a,b + c for all a > b, a,b,c >, which roves the first art of the theorem. The roof of the second art is essentially the same, excet that, as Γb a < b, Γb+c < Γa Γa+c, and so B a + c,b < B a,b + c for all a < b, a,b,c >. We may now roceed to the roofs of theorems 9 and. Proof of theorem 9: From, if = q then, using 9 and noting that xγx = Γx + : ℸ, if = e, = Γ = = = ℸ,c ; ℸ,c ; if c if = c Γ c c Γ if if = Γ if Γ But e,q by the Cramer-Rao bound [], so otimal efficiency is achieved for = ǫ σ =. Hence ǫ ot = is a solution. Proof of theorem : ote that e,q C for. ote also that the range of is [,. If ǫ ot =, and hence ot =, e,q must have a global maxima at =, which imlies that the gradient e,q of e,q must be non-ositive at. 7 Given this, to rove the theorem, it is sufficient to rove that the gradient e,q of e,q at = is ositive for all > q intuitively it may be seen that if the gradient is ositive at zero then increasing must result in an increase in e,q, and so = cannot be a maxima of e,q, global or otherwise. Using 8, it is straightforward to show that e,q = d,q ē,q, where: e c c ℸ, c ; q = ē,q= ℸ, c ; c e c ℸ,c ; ℸ,c ; q = c q ℸ q,c ; q ℸ q,c ; ; c ℸq,c ℸ q,c ; q 7 As = is at the end of the range [, it follows that if the gradient is negative at = then there will be a local maxima at that oint. 8 9

19 8 and: d q, = c e c Γ ℸ,c ; q = Γ Γ ℸ,c ;ℸ,c ; Γ ℸ,c ; q = q q Γ Γ ℸq,c ;ℸq,c ;ℸq,c ; Γ ℸ q,c ; q is a ositive smooth function for,,q Z +. Hence if ē,q > for > q Z + then e,q > for > q Z +, which is sufficient to rove the theorem. Using 8, for all,q Z + : if q = ē,q = Γ Γ if q = Γ q Γ q q Γ q if q Γ q Γ q Which roves the theorem in the case q =. Consider the case > q >. Writing q = + m, = + m + n, where m,n Z +, and using the result xγx = Γx +, if q : ē,q = q q = Γ q Γ q m+n+ Γ m+n+ Γ m+n+ m+n+ = Γa+b+c ΓaΓb Γ q Γ q m+n+ Γ m+n+ m+ Γ m+ B a + c,b B a,b + c where a = m+n+ m+ m+n+ >, b = m+n+ > and c = m+n+ >. As a > b, it follows from theorem that B a + c,b > B a,b + c, and hence ē >, which roves the theorem for the case q. Suose that n and subsequently q is treated as a real number such that n so q. The inequality B a + c,b > B a,b + c will still hold, as a > b for all n [,, m >. Hence lim ē q +,q >. Furthermore, using theorem : q lim q + q Γ q Γ q Γ q Γ q = Γ Γ Γ which imlies that: ē, = lim q + ē,q > and so ē,q > for all > q Z +. Hence e,q > for all > q Z +, which roves the theorem. Using an analogous method, we can also give a artial roof of conjecture : Partial roof of conjecture : To rove that ǫ ot = it is necessary although insufficient to rove that e,q has a local maxima at =. By analogy with the roof of theorem it is necessary to rove that the gradient e,q of e,q at = is non-ositive for all < q. From the roof of theorem, e,q = d,q ē,q, d,q >, for all,,q Z +, where ē,q is given by. Hence, if ē,q, e,q, and e,q will have a local maxima at =. As < < q, q. If q =, =, and hence by ē, =. Therefore e, has a local maxima at =. If q, writing q = + m, = + m n, where m Z +, n {,,...,m + }, it can be seen that if > q > : ē,q = Γa+b+c ΓaΓb B a + c,b B a,b + c where a = m n+ m+ m n+ >, b = m n+ > and c = m n+ >. As a < b, it follows from theorem that B a + c,b < B a,b + c, and hence ē <. Therefore, in general, e,q has a local maxima at = for all > q Z +. ow, if we could rove that this is a unique and hence global maxima, it would follow that ot = ǫ ot =, which would rove the conjecture. Alternatively, roving that e,q for all would demonstrate that the maxima at = is global although not necessarily unique and thereby rove the conjecture. Unfortunately we have been unable to do either rigorously, and so the roof remains incomlete.

9 REFERECES [] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vanik, Suort vector regression machines, in Advances in eural Information Processing Systems, M. C. Mozer, M. I. Jordan, and T.

Suort Vector Methods for Function Aroximation, Regression Estimation, and Signal Processing,. 8 87. [] A. Smola and B. Schölkof, A tutorial on suort vector regression, Tech. Re. eurocolt Technical Reort Series, C-TR-998-, Oct 998.

20 9 REFERECES [] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vanik, Suort vector regression machines, in Advances in eural Information Processing Systems, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds., vol. 9. The MIT Press, 997,.. [] V. Vanik, S. Golowich, and A. Smola, Advances in eural Information Processing Systems. MIT Press, 997, vol. 9, ch. Suort Vector Methods for Function Aroximation, Regression Estimation, and Signal Processing, [] A. Smola and B. Schölkof, A tutorial on suort vector regression, Tech. Re. eurocolt Technical Reort Series, C-TR-998-, Oct 998. [] C. Cortes and V. Vanik, Suort vector networks, Machine Learning, vol., no.,. 7 97, 99. [] I. Guyon, B. Boser, and V. Vanik, Automatic caacity tuning of very large VC-dimension classifiers, in Advances in eural Information Processing Systems, S. Hanson, J. Cowan, and C. Giles, Eds., vol.. Morgan Kaufmann, 99,. 7. [6] C. J. C. Burges, A tutorial on suort vector machines for attern recognition, Knowledge Discovery and Data Mining, vol.,. 67, 998. [7] B. Schölkof, P. Bartlett, A. Smola, and R. Williamson, Shrinking the tube: A new suort vector regression algorithm, Advances in eural Information Processing Systems, vol.,. 6, 999. [8] B. Schölkof and A. J. Smola, ew suort vector algorithms, Tech. Re. eurocolt Technical Reort Series, C-TR-998-, ov 998. [9] A. Smola,. Murata, B. Schölkof, and K.-R. Muller, Asymtotically otimal choice of ǫ-loss for suort vector machines, in Proceedings of the 8th International Conference on Artificial eural etworks - Persectives in eural Comuting, L. iklasson, M. Boden, and T. Ziemke, Eds. Sringer Verlag, 998,.. [] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Suort Vector Machines. ew Jersey: World Scientific Publishing,. [], Weighted least squares suort vector machines: robustness and sarse aroximation, eurocomuting, Secial issue on fundamental and information rocessing asects of neurocomuting, vol. 8, no.,. 8, Oct. [] A. Shilton and M. Palaniswami, A modified ν-sv method for simlified regression, in Proceedings of the International Conference on Intelligent Sensing and Information Processing,,. 7. [] J. Mercer, Functions of ositive and negative tye, and their connection with the theory of integral equations, Transactions of the Royal Society of London, vol. 9, no. A, 99. [] B. Schölkof and A. J. Smola, Learning with Kernels: Suort Vector Machines, Regularization, Otimization, and Beyond. Cambridge, Massachusetts: MIT Press,. [] A. Smola, B. Schölkof, and K. Muller, General cost functions for suort vector regression, in Proceedings of the inth Australian Conf. on eural etworks, T. Downs, M. Frean, and M. Gallagher, Eds., 998, [6] C. Chang and C. Lin, LIBSVM: a library for suort vector machines version.,. [7] M. Palaniswami and A. Shilton, Adative suort vector machines for regression, in Proceedings of the 9th International Conference on eural Information Processing,. [8] R. Fletcher, Practical Methods of Otimisation. Chichester: John Wiley and Sons, 98. [9] K. Levenberg, A method for the solution of certain roblems in least squares, Quart. Al. Math., vol.,. 6 68, 9. [] D. Marquardt, An algorithm for least-squares estimation of nonlinear arameters, SIAM J. Al. Math., vol.,., 96. [] C. R. Rao, Linear Statistical Inference and its Alications. ew York: Wiley, 97. []. Murata, S. Yoshizawa, and S.-I. Amari, etwork Information Criterion determining the number of hidden units for an artificial neural network model, IEEE Transactions on eural etworks, vol., no. 6, , ovember 99. [Online]. Available: citeseer.nj.nec.com/murata9network.html [] S. G. Krantz, Handbook of Comlex Variables. Birkhusers, 999. [] A. Chalimourda, B. Schölkof, and A. Smola, Exerimentally otimal ν in suort vector regression for different noise models and arameter settings, eural etworks, vol. 7, no.,. 7,. [] A. Erdélyi, W. Magnus, F. Oberhettinger, and F. G. Tricomi, Higher Transcendental Functions. ew York: Krieger, 98, vol., ch.. The Beta Function,. 9. [6] H. Bateman, Higher Transcendental Functions. McGraw-Hill Book Comany Inc., 9, vol.. Alistair Shilton received his combined B.Sc. / B.Eng. degree from the University of Melbourne, Melbourne, Australia, in, secialising in hysics, alied mathematics and electronic engineering. He is currently ursuing the Ph.D. degree in electronic engineering at the University of Melbourne. His research interests include machine learning, secializing in suort vector machines; signal rocessing, communications and differential toology. Daniel Lai received his B.Eng degree in Electrical and Comuter Systems from Monash University, Melbourne, Australia in. He is currently ursuing his Ph.D at Monash University. His research interests include otimization techniques for machine learning formulations, decomosition techniques, evolutionary otimization and dynamic rogramming.

M. Palaniswami received his BEHons from the University of Madras, ME from the Indian Institute of science, India, MEngSc from the University of Melbourne and Ph.

21 M. Palaniswami received his BEHons from the University of Madras, ME from the Indian Institute of science, India, MEngSc from the University of Melbourne and Ph.D from the University of ewcastle, Australia before rejoining the University of Melbourne. He has been serving the University of Melbourne for over 6 years. He has ublished more than 8 refereed aers and a huge roortion of them aeared in restigious IEEE Journals and Conferences. He was given a Foreign Secialist Award by the Ministry of Education, Jaan in recognition of his contributions to the field of Machine Learning. He served as associate editor for Journals/transactions including IEEE Transactions on neural etworks and Comutational Intelligence for Finance.. His research interests include SVMs, Sensors and Sensor etworks, Machine Learning, eural etwork, Pattern Recognition, Signal Processing and Control. He is the Co-Director of Centre of Exertise on etworked Decision & Sensor Systems. He is the associate editor for International Journal of Comutational Intelligence and Alications and serves on the editorial board of AZ Journal on Intelligent Information rocessing Systems. He is also the Subject Editor for International Journal on Distributed Sensor etworks.

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules