Duality, Geometry, and Support Vector Regression

Size: px

Start display at page:

Download "Duality, Geometry, and Support Vector Regression"

Edmund Marshall
5 years ago
Views:

1 ualit, Geometr, and Support Vector Regression Jinbo Bi and Kristin P. Bennett epartment of Mathematical Sciences Rensselaer Poltechnic Institute Tro, NY 80 Abstract We develop an intuitive geometric framework for support vector regression (SVR). B eamining when ɛ-tubes eist, we show that SVR can be regarded as a classification problem in the dual space. Hard and soft ɛ-tubes are constructed b separating the conve or reduced conve hulls respectivel of the training data with the response variable shifted up and down b ɛ. A novel SVR model is proposed based on choosing the ma-margin plane between the two shifted datasets. Maimizing the margin corresponds to shrinking the effective ɛ-tube. In the proposed approach the effects of the choices of all parameters become clear geometricall. Introduction Support Vector Machines (SVMs) [6] are a ver robust methodolog for inference with minimal parameter choices. Intuitive geometric formulations eist for the classification case addressing both the error metric and capacit control [, ]. For linearl separable classification, the primal SVM finds the separating plane with maimum hard margin between two sets. The equivalent dual SVM computes the closest points in the conve hulls of the data from each class. For the inseparable case, the primal SVM optimizes the soft margin of separation between the two classes. The corresponding dual SVM finds the closest points in the reduced conve hulls. In this paper, we derive analogous arguments for SVM regression (SVR). We provide a geometric eplanation for SVR with the ɛ-insensitive loss function. From the primal perspective, a linear function with no residuals greater than ɛ corresponds to an ɛ-tube constructed about the data in the space of the data attributes and the response variable [6] (see e.g. Figure (a)). The primar contribution of this work is a novel geometric interpretation of SVR from the dual perspective along with a mathematicall rigorous derivation of the geometric concepts. In Section, for a fied ɛ > 0 we eamine the question When does a perfect or hard

2 ɛ-tube eist?. With dualit analsis, the eistence of a hard ɛ-tube depends on the separabilit of two sets. The two sets consist of the training data augmented with the response variable shifted up and down b ɛ. In the dual space, regression becomes the classification problem of distinguishing between these two sets. The geometric formulations developed for the classification case [] become applicable to the regression case. We call the resulting formulation conve SVR (C-SVR) since it is based on conve hulls of the augmented training data. Much like in SVM classification, to compute a hard ɛ-tube, C-SVR computes the nearest points in the conve hulls of the augmented classes. The corresponding maimum margin (ma-margin) planes define the effective ɛ-tube. The size of margin determines how much the effective ɛ-tube shrinks. Similarl, to compute a soft ɛ-tube, reduced-conve SVR (RC-SVR) finds the closest points in the reduced conve hulls of the two augmented sets. This paper introduces the geometricall intuitive RC-SVR formulation which is a variation of the classic ɛ-svr [6] and ν-svr models [5]. If parameters are properl tuned, the methods perform similarl although not necessaril identicall. RC- SVR eliminates the pesk parameter C used in ɛ-svr and ν-svr. The geometric role or interpretation of C is not known for these formulations. The geometric roles of the two parameters of RC-SVR, ν and ɛ, are ver clear, facilitating model selection, especiall for noneperts. Like ν-svr, RC-SVR shrinks the ɛ-tube and has a parameter ν controlling the robustness of the solution. The parameter ɛ acts as an upper bound on the size of the allowable ɛ-insensitive error function. In addition, RC-SVR can be solved b fast and scalable nearest-point algorithms such as those used in [3] for SVM classification. When does a hard ɛ-tube eist? ε ε (a) - (b) - (c) - (d) Figure : The (a) primal hard ɛ 0-tube, and dual cases: (b) dual strictl separable ɛ > ɛ 0, (c) dual separable ɛ = ɛ 0, and (d) dual inseparable ɛ < ɛ 0. SVR constructs a regression model that minimizes some empirical risk measure regularized to control capacit. Let be the n predictor variables and the dependent response variable. In [6], Vapnik proposed using the ɛ-insensitive loss function L ɛ (,, f) = f() ɛ = ma (0, f() ɛ), in which an eample is in error if its residual f() is greater than ɛ. Plotting the points in (, ) space as in Figure (a), we see that for a perfect regression model the data fall in a hard ɛ-tube about the regression line. Let (X i, i ) be an eample where i =,,, m, X i is the i th predictor vector, and i is its response. The training data are then (X, ) where X i is a row of the matri X R m n and R m is the response. A hard ɛ-tube for a fied ɛ > 0 is defined as a plane = w + b satisfing ɛe Xw be ɛe where e is an m-dimensional vector of ones. When does a hard ɛ-tube eist? Clearl, for ɛ large enough such a tube alwas

3 eists for finite data. The smallest tube, the ɛ 0 -tube, can be found b optimizing: min ɛ s.t. ɛe Xw be ɛe () w,b,ɛ Note that the smallest tube is tpicall not the ɛ-svr solution. Let + and be formed b augmenting the data with the response variable respectivel increased and decreased b ɛ, i.e. + = {(X i, i + ɛ), i =,, m} and = {(X i, i ɛ), i =,, m}. Consider the simple problem in Figure (a). For an fied ɛ > 0, there are three possible cases: ɛ > ɛ 0 in which strict hard ɛ-tubes eist, ɛ = ɛ 0 in which onl ɛ 0 -tubes eist, and ɛ < ɛ 0 in which no hard ɛ-tubes eist. A strict hard ɛ-tube with no points on the edges of the tube onl eists for ɛ > ɛ 0. Figure (b-d) illustrates what happens in the dual space for each case. The conve hulls of + and are drawn along with the ma-margin plane in (b) and the supporting plane in (c) for separating the conve hulls. Clearl, the eistence of the tube is directl related to the separabilit of + and. If ɛ > ɛ 0 then a strict tube eists and the conve hulls of + and are strictl separable. There are infinitel man possible ɛ-tubes when ɛ > ɛ 0. One can see that the ma-margin plane separating + and corresponds to one such ɛ. In fact this plane forms an ˆɛ tube where ɛ > ˆɛ ɛ 0. If ɛ = ɛ 0, then the conve hulls of + and are separable but not strictl separable. The plane that separates the two conve hulls forms the ɛ 0 tube. In the last case, where ɛ < ɛ 0, the two sets + and intersect. No ɛ-tubes or ma-margin planes eist. It is eas to show b construction that if a hard ɛ-tube eists for a given ɛ > 0 then the conve hulls of + and will be separable. If a hard ɛ-tube eists, then there eists (w, b) such that ( + ɛe) Xw be 0, ( ɛe) Xw be 0. () For an conve combination of +, ( ) X (+ɛe) u where e u =, u 0 of points (X i, i + ɛ), i =,,, m, we have ( + ɛe) u w (X u) b 0. Similarl for, ( ) X ( ɛe) v where e v =, v 0 of points (X i, i ɛ), i =,,, m, we have ( ɛe) v w (X v) b 0. Then the plane = w + b in the ɛ-tube separates the two conve hulls. Note the separating plane and the ɛ-tube plane are the same. If no separating plane eists, then there is no tube. Gale s Theorem of the alternative can be used to precisel characterize the ɛ-tube. Theorem. (Conditions for eistence of hard ɛ-tube) A hard ɛ-tube eists for a given ɛ > 0 if and onl if the following sstem in (u, v) has no solution: X u = X v, e u = e v =, ( + ɛe) u ( ɛe) v < 0, u 0, v 0. (3) Proof A hard ɛ-tube eists if and onl if Sstem () has a solution. B Gale s Theorem of the alternative [4], sstem () has a solution if and onl if the following alternative sstem has no solution: X u = X v, e u = e v, ( + ɛe) u ( ɛe) v =, u 0, v 0. Rescaling b σ where σ = e u = e v > 0 ields the result. We use the following definitions of separation of conve sets. Let + and be nonempt conve sets. A plane H = { : w = α} is said to separate + and if w α, + and w α,. H is said to strictl separate + and if w α + for +, and w α for each where is a positive scalar. The sstem A c has a (or has no) solution if and onl if the alternative sstem A = 0, c =, 0 has no (or has a) solution.

4 Note that if ɛ ɛ 0 then ( + ɛe) u ( ɛe) v 0. for an (u, v) such that X u = X v, e u = e v =, u, v 0. So as a consequence of this theorem, if + and are separable, then a hard ɛ-tube eists. 3 Constructing the ɛ-tube For an ɛ > ɛ 0 infinitel man possible ɛ-tubes eist. Which ɛ-tube should be used? The linear program () can be solved to find the smallest ɛ 0 -tube. But this corresponds to just doing empirical risk minimization and ma result in poor generalization due to overfitting. We know capacit control or structural risk minimization is fundamental to the success of SVM classification and regression. We take our inspiration from SVM classification. In hard-margin SVM classification, the dual SVM formulation constructs the ma-margin plane b finding the two nearest points in the conve hulls of the two classes. The ma-margin plane is the plane bisecting these two points. We know that the eistence of the tube is linked to the separabilit of the shifted sets, + and. The ke insight is that the regression problem can be regarded as a classification problem between + and. The two sets + and defined as in Section both contain the same number of data points. The onl significant difference occurs along the dimension as the response variable is shifted up b ɛ in + and down b ɛ in. For ɛ > ɛ 0, the ma-margin separating plane corresponds to a hard ˆɛ-tube where ɛ > ˆɛ ɛ 0. The resulting tube is smaller than ɛ but not necessaril the smallest tube. Figure (b) shows the ma-margin plane found for ɛ > ɛ 0. Figure (a) shows that the corresponding linear regression function for this simple eample turns out to be the ɛ 0 tube. As in classification, we will have a hard and soft ɛ-tube case. The soft ɛ-tube with ɛ ɛ 0 is used to obtain good generalization when there are outliers. 3. The hard ɛ-tube case We now appl the dual conve hull method to constructing the ma-margin plane for our augmented sets + and assuming the are strictl separable, i.e. ɛ > ɛ 0. The problem is illustrated in detail in Figure. The closest points of + and can be found b solving the following dual C-SVR quadratic program: min u,v ( X (+ɛe) ) u ( X ( ɛe) ) v s.t. e u =, e v =, u 0, v 0. (4) Let the closest points in the conve hulls of + and be c = ( )û X (+ɛe) and d = ( X )ˆv ( ɛe) respectivel. The ma-margin separating plane bisects these two points. The normal (ŵ, ˆδ) of the plane is the difference between them, i.e., ŵ = X û X ˆv, ˆδ = ( + ɛe) û ( ɛe) ˆv. The threshold, ˆb, is the distance from the origin to the) point halfwa ) between the two closest points along the normal: ˆb = +ˆδ. The separating plane has the equation ŵ +ˆδ ˆb = 0. ( û+ ˆv ( ŵ X û+x ˆv Rescaling this plane ields the regression function. ual C-SVR (4) is in the dual space. The corresponding Primal C-SVR is:

5 Figure : The solution ˆɛ-tube found b C-SVR can have ˆɛ < ɛ. Squares are original data. ots are in +. Triangles are in. Support Vectors are circled. min w,δ,α,β s.t. w + δ (α β) Xw + δ( + ɛe) αe 0 Xw + δ( ɛe) βe 0. (5) ual C-SVR (4) can be derived b taking the Wolfe or Lagrangian dual [4] of primal C-SVR (5) and simplifing. We prove that the optimal plane from C-SVR bisects the ˆɛ tube. The supporting planes for class + and class determines the lower and upper edges of the ˆɛ-tube respectivel. The support vectors from + and correspond to the points along the lower and upper edges of the ˆɛ-tube. See Figure. Theorem 3. (C-SVR constructs ˆɛ-tube) Let the ma-margin plane obtained b C-SVR ( (4) be ) ŵ +ˆδ ˆb = 0 where ŵ = X û X ˆv, ˆδ = (+ɛe) û ( ɛe) ˆv, and ˆb = ŵ + ˆδ ( ). If ɛ > ɛ 0, then the plane = w + b corresponds X û+x ˆv û+ ˆv to an ˆɛ-tube of training data (X i, i ), i =,,, m where w = ŵˆδ, b = ˆbˆδ ˆɛ = ɛ ˆδ < ɛ. Proof First, we show ˆδ > 0. B the Wolfe dualit theorem [4], ˆα ˆβ > 0, since the objective values of (5) and the negative objective value of (4) are equal at optimalit. B complementarit, the closest points are right on the margin planes ŵ + ˆδ ˆα = 0 and ŵ + ˆδ ˆβ = 0 respectivel, so ˆα = ŵ X û + ˆδ( + ɛe) û and ˆβ = ŵ X ˆv+ˆδ( ɛe)ˆv. Hence ˆb = ˆα+ ˆβ, and ŵ, ˆδ, ˆα, and ˆβ satisf the constraints of problem (5), i.e., Xŵ+ ˆδ(+ɛe) ˆαe 0, Xŵ+ ˆδ( ɛe) ˆβe 0. Then subtract the second inequalit from the first inequalit: ˆδɛ ˆα + ˆβ 0, that is, ˆδ ɛ > 0 because ɛ > ɛ 0 0. Rescale constraints b ˆδ < 0, and reverse the signs. Let w = ŵˆδ, then the inequalities become Xw ɛe ˆαˆδ e, Xw ɛe ˆβ ˆδ e. Let b = ˆbˆδ, then ˆαˆδ = b + and ˆβ = b. Substituting into the previous ˆδ ( ˆδ ) ˆδ ( ) inequalities ields Xw ɛ e be, Xw ɛ e be. enote ˆδ and ˆɛ = ɛ < ɛ. These inequalities become Xw+be ˆɛe, Xw+be ˆɛe. ˆδ Hence the plane = w + b is in the middle of the ˆɛ < ɛ tube. 3. The soft ɛ-tube case For ɛ < ɛ 0, a hard ɛ-tube does not eist. Making ɛ large to fit outliers ma result in poor overall accurac. In soft-margin classification, outliers were handled in the ˆδ

6 ε^ Figure 3: Soft ˆɛ-tube found b RC-SVR: left: dual, right: primal space. dual space b using reduced conve hulls. The same strateg works for soft ɛ-tubes, see Figure 3. Instead of taking the full conve hulls of + and, we reduce the conve hulls awa from the difficult boundar cases. RC-SVR computes the closest points in the reduced conve hulls min u,v s.t. ( ) ( X (+ɛe) u X ) ( ɛe) v e u =, e v =, 0 u e, 0 v e. Parameter determines the robustness of the solution b reducing the conve hull. limits the influence of an single point. As in ν-svm, we can parameterize b ν. Let = where m is the number of points. Figure 3 illustrates the case νm for m = 6 points, ν = /6, and = /. In this eample, ever point in the reduced conve hull must depend on at least two data points since m i= u i = and 0 u i /. In general, ever point in the reduced conve hull can be written as the conve combination of at least / = ν m. Since these points are eactl the support vectors and there are two reduced conve hulls, νm is a lower bound on the number of support vectors in RC-SVR. B choosing ν sufficientl large, the inseparable case with ɛ ɛ 0 is transformed into a separable case where once again our nearest-points-in-the-conve-hull-problem is well defined. As in classification, the dual reduced conve hull problem corresponds to computing a soft ɛ-tube in the primal space. Consider the following soft tube version of the primal C-SVR (7) which has its Wolfe ual RC-SVR (6): min w,δ,α,β,ξ,η w + δ (α β) + C(e ξ + e η) s.t. Xw + δ( + ɛe) αe + ξ 0, ξ 0 Xw + δ( ɛe) βe η 0, η 0 The results of Theorem 3. can be easil etended to soft ɛ-tubes. Theorem 3. (RC-SVR constructs soft ˆɛ-tube) Let the soft ma-margin plane obtained b RC-SVR (6) be ŵ + ˆδ ˆb = 0 where ŵ = X û X ˆv, ˆδ = ( + ɛe) û ( ɛe) ˆv, ( ) ( ) X and ˆb = û+x ˆv ŵ + û+ ˆv ˆδ. If 0 < ɛ ɛ 0, then the plane = w + b corresponds to a soft ˆɛ = ɛ < ɛ-tube of training data ˆδ (X i, i ), i =,,, m, i.e., a ˆɛ-tube of reduced conve hull of training data where w = ŵˆδ, b = ˆbˆδ and α = ŵ X û + ˆδ( + ɛe) û, β = ŵ X ˆv + ˆδ( ɛe) ˆv. Notice that the α and β determine the planes parallel to the regression plane and through the closest points in each reduced conve hull of shifted data. In the α β (6) (7)

7 inseparable case, these planes are parallel but not necessaril identical to the planes obtained b the primal RC-SVR (7). Nonlinear C-SVR and RC-SVR can be achieved b using the usual kernel trick. Let Φ b a nonlinear mapping of such that k(x i, X j ) = Φ(X i ) Φ(X j ). The objective function of C-SVR (4) and RC-SVR (6) applied to the mapped data becomes m m i= j= ((u i v i )(u j v j )(Φ(X i ) Φ(X j ) + i j )) + ɛ m i= ( i(u i v i )) = m m i= j= ((u i v i )(u j v j )(k(x i, X j ) + i j )) + ɛ m i= ( i(u i v i )) (8) The final regression model after optimizing C-SVR or RC-SVR with kernels takes the form of f() = m i= (ū i v i ) k(x i, ) + b, where ū i = ûi ˆδ, v i = ˆvi ˆδ, ˆδ = (û ˆv) +ɛ, and the intercept term b = (û+ˆv) K(û ˆv) + (û+ˆv) ˆδ where K ij = k(x i, X j ). 4 Computational Results We illustrate the difference between RC-SVR and ɛ-svr on a to linear problem 3. Figure 4 depicts the functions constructed b RC-SVR and ɛ-svr for different values of ɛ. For large ɛ, ɛ-svr produces undesirable results. RC-SVR constructs the same function for ɛ sufficientl large. Too small ɛ can result in poor generalization ε = 0.75, 0.45, 0.5 ε = 0.75 ε = ε = ε = (a) (b) Figure 4: Regression lines from (a) ɛ-svr and (b) RC-SVR with distinct ɛ. In Table, we compare RC-SVR, ɛ-svr and ν-svr on the Boston Housing problem. Following the eperimental design in [5] we used RBF kernel with σ = 3.9, C = 500 m for ɛ-svr and ν-svr, and ɛ = 3.0 for RC-SVR. RC-SVR, ɛ-svr, and ν-svr are computationall similar for good parameter choices. In ɛ-svr, ɛ is fied. In RC-SVR, ɛ is the maimum allowable tube width. Choosing ɛ is critical for ɛ-svr but less so for RC-SVR. Both RC-SVR and ν-svr can shrink or grow the tube according to desired robustness. But ν-svr has no upper ɛ bound. 5 Conclusion and iscussion B eamining when ɛ-tubes eist, we showed that in the dual space SVR can be regarded as a classification problem. Hard and soft ɛ-tubes are constructed b separating the conve or reduced conve hulls respectivel of the training data with the response variable shifted up and down b ɛ. We proposed RC-SVR based on choosing the soft ma-margin plane between the two shifted datasets. Like ν-svm, RC-SVR shrinks the ɛ-tube. The ma-margin determines how much the tube can shrink. omain knowledge can be incorporated into the RC-SVR parameters ɛ 3 The data consist of (, ): (0 0), ( 0.), ( 0.7), (.5 0.9), (3.) and (5 ). The CPLEX 6.6 optimization package was used.

8 Table : Testing Results for Boston Housing, MSE= average of mean squared errors of 5 testing points over 00 trials, ST: standard deviation ν RC-SVR MSE ST ɛ ɛ-svr MSE ST ν ν-svr MSE ST and ν. The parameter C in ν-svm and ɛ-svr has been eliminated. Computationall, no one method is superior for good parameter choices. RC-SVR alone has a geometricall intuitive framework that allows users to easil grasp the model and its parameters. Also, RC-SVR can be solved b fast nearest point algorithms. Considering regression as a classification problem suggests other interesting SVR formulations. We can show ɛ-svr is equivalent to finding closest points in a reduced conve hull problem for certain C, but the equivalent problem utilizes a different metric in the objective function than RC-SVR. Perhaps other variations would ield even better formulations. Acknowledgments Thanks to referees and Bernhard Schölkopf for suggestions to improve this work. This work was supported b NSF IRI , NSF IIS References [] K. Bennett and E. Bredensteiner. ualit and Geometr in SVM Classifiers. In P. Langle, eds., Proc. of Seventeenth Intl. Conf. on Machine Learning, p 57 64, Morgan Kaufmann, San Francisco, 000. []. Crisp and C. Burges. A Geometric Interpretation of ν-svm Classifiers. In S. Solla, T. Leen, and K. Muller, eds., Advances in Neural Info. Proc. Ss., Vol. p 44 5, MIT Press, Cambridge, MA, 999. [3] S.S. Keerthi, S.K. Shevade, C. Bhattachara and K.R.K. Murth, A Fast Iterative Nearest Point Algorithm for Support Vector Machine Classifier esign, IEEE Transactions on Neural Networks, Vol., pp.4-36, 000. [4] O. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, 994. [5] B. Schölkopf, P. Bartlett, A. Smola and R. Williamson. Shrinking the Tube: A New Support Vector Regression Algorithm. In M. Kearns, S. Solla, and. Cohn eds., Advances in Neural Info. Proc. Ss., Vol, MIT Press, Cambridge, MA, 999. [6] V. Vapnik. The Nature of Statistical Learning Theor. Wile, New York, 995.

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights Linear Discriminant Functions and Support Vector Machines Linear, threshold units CSE19, Winter 11 Biometrics CSE 19 Lecture 11 1 X i : inputs W i : weights θ : threshold 3 4 5 1 6 7 Courtesy of University