Robustness and Regularization of Support Vector Machines

Size: px

Start display at page:

Download "Robustness and Regularization of Support Vector Machines"

Lydia Anabel Bailey
5 years ago
Views:

1 Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA Shie Mannor ECE, McGill University Montreal, QC Canada Abstract We consider a robust classification proble and show that standard regularized SVM is a special case of our forulation, providing an explicit link between regularization and robustness. At the sae tie, the physical connection of noise and robustness suggests the potential for a broad new faily of robust classification algoriths. Finally, we show that robustness is a fundaental property of classification algoriths, by re-proving consistency of support vector achines using only robustness arguents (instead of VC diension or stability). 1 Introduction Support Vector Machines or SVMs [1, 2] find the hyperplane in the feature space that achieves axiu saple argin in the separable case. When the saples are not separable, a penalty ter that approxiates the total training-error is considered [3]. It is well known that iniizing the training error itself can lead to poor classification perforance for new unlabeled data because of, essentially, overfitting [4]. One of the ost popular ethods proposed to cobat this proble is iniizing a cobination of the training-error and a regularization ter. The resulting regularized classifier perfors better on new data. This phenoenon is often interpreted fro a statistical learning theory view: the regularization ter restricts the coplexity of the classifier, hence the deviation of the testing error and the training error is controlled [5, 6, 7]. We consider a different setup, assuing that soe non-iid (potentially adversarial) disturbance is added to the training saples we observe. We follow a robust optiization approach (e.g., [8, 9]) iniizing the worst possible epirical error under such disturbances. The use of robust optiization in classification is not new (e.g., [10, 11]). Past robust classification odels consider only box-type uncertainty sets, which allow the possibility that the data have all been skewed in soe non-neutral anner. We develop a new robust classification fraework that treats non box-type uncertainty sets, itigates conservatis, and provides an explicit connection to regularization. Our contributions include: We show that the standard regularized SVM classifier is a special case of our robust classification, thus explicitly relating robustness and regularization. This provides an alternative explanation to the success of regularization, and also suggests new physically-otivated ways to construct regularizers. Our robust SVM forulation perits finer control of the adversarial disturbance, restricting it to satisfy aggregate constraints across data points, therefore controlling the conservatis. 1

2 We show that the robustness perspective, steing fro a non-iid analysis, can be useful in a standard iid setup, by using it to give a new proof of consistency for standard SVM classification. This result iplies that generalization ability is a direct result of robustness to local disturbances, and we can construct learning algoriths that generalize well by robustifying non-consistent algoriths. We explain here how the explicit equivalence of robustness and regularization we derive differs fro previous work, and why it is interesting. While certain equivalence relationships between robustness and regularization have been established for probles outside the achine learning field ([8, 9]), their results do not directly apply to the classification proble. Research on classifier regularization ainly focuses on bounding the coplexity of the function class (e.g., [5, 6, 7]). Meanwhile, research on robust classification has not attepted to relate robustness and regularization (e.g., [10, 11, 12]), in part due to the robustness forulations used there. The connection of robustness and regularization in the SVM context is iportant for the following reasons. It gives an alternative and potentially powerful explanation of the generalization ability of the regularization ter. In the classical achine learning literature, the regularization ter bounds the coplexity of the class of classifiers. The robust view of regularization regards the testing saples as a perturbed copy of the training saples. We show that when the total perturbation is given or bounded, the regularization ter bounds the gap between the classification errors of the SVM on these two sets of saples. In contrast to the standard PAC approach, this bound depends neither on how rich the class of candidate classifiers is, nor on an assuption that all saples are picked in an i.i.d. anner. In addition, this suggests novel approaches to designing good classification algoriths, in particular, designing the regularization ter. In Section 3 we use this new view to provide a novel proof of consistency that does not rely on VC-diension or stability arguents. In the PAC structural-risk iniization approach, regularization is chosen to iniize a bound on the generalization error based on the training error and a coplexity ter. This coplexity ter typically leads to overly ephasizing the regularizer, and indeed this approach is known to often be too pessiistic ([13]). The robust approach offers another avenue. Since both noise and robustness are physical processes, a close investigation of the application and noise characteristics at hand, can provide insights into how to properly robustify, and therefore regularize the classifier. For exaple, it is widely known that noralizing the saples so that the variance aong all features are roughly the sae often leads to good generalization perforance. Fro the robustness perspective, this siply says that the noise is skewed (ellipsoidal) rather than spherical, and hence an appropriate robustification ust be designed to fit the skew of the physical noise process. Notation: Capital letters and boldface letters are used to denote atrices and colun vectors, respectively. For a given nor, we use to denote its dual nor. The set of integers fro 1 to n is denoted by [1 : n]. 2 Robust Classification and Regularization We consider the standard binary classification proble, where we are given a finite nuber of training saples {x i, y i } Rn { 1, +1}, and ust find a linear classifier, specified by the function h (x) = sgn( w, x + b). For the standard regularized classifier, the paraeters (w, b) are obtained by solving the following convex optiization proble: { in r(w, b) + ax [ 1 y i ( w,x i + b), 0 ]}. where r(w, b) is a regularization ter. Previous robust classification work [10, 14] considers the classification proble where the input are subject to (unknown) disturbances δ = (δ 1,..., δ ) and essentially solves the following ini-ax proble: in { ax r(w, b) + ax [ 1 y i ( w, x i δ i + b), 0 ]}, (1) δ Nbox for a box-type uncertainty set N box. That is, let N i denote the projection of N box onto the δ i coponent, then N box = N 1 N. Effectively, this allows siultaneous worst-case disturbances 2

3 across any saples, and leads to overly conservative solutions. The goal of this paper is to obtain a robust forulation where the disturbances {δ i } ay be eaningfully taken to be correlated, i.e., to solve for a non-box-type N : { in ax r(w, b) + ax [ 1 y i ( w,x i δ i + b), 0 ]}. (2) δ N We define explicitly the correlated disturbance (or uncertainty) which we study below. Definition A set N 0 R n is called an Atoic Uncertainty Set if [ (I) 0 N 0 ; (II) sup w δ ] [ = sup w δ ] <, w R n. δ N 0 δ N 0 2. Let N 0 be an atoic uncertainty set. A set N R n is called a Concave Correlated Uncertainty Set (CCUS) of N 0, if N N N +. Here N Nt ; N t {(δ 1,, δ ) δ t N 0 ; δ i =t = 0; } t=1 N + {(α 1 δ 1,, α δ ) α i = 1; α i 0, δ i N 0, i}. The concave correlated uncertainty definition odels the case where the disturbances on each saple are treated identically, but their aggregate behavior across ultiple saples is controlled. In particular, {(δ 1,, δ ) δ i c} is a CCUS with N 0 = { δ } δ c. Theore 1. Assue {x i, y i } are non-separable, r( ) : Rn+1 R is an arbitrary function, N is a CCUS with corresponding atoic uncertainty set N 0. Then the following two probles are equivalent: { in sup r(w, b) + ax [ 1 y i ( w,x i δ i + b), 0 ]} ; (3) (δ 1,,δ ) N in : r(w, b) + sup (w δ) + δ N 0 ξ i, (4) s.t. : ξ i 1 [y i ( w, x i + b)], i = 1,...,; ξ i 0, i = 1,...,. Proof. We outline the proof. Let v(w, b) sup δ N0 (w δ)+ ax[ 1 y i ( w,x i +b), 0 ]. It suffices to show that for any (ŵ,ˆb) R n+1, v(ŵ,ˆb) sup (δ 1,,δ ) N + sup (δ 1,,δ ) N ax [ 1 y i ( ŵ,x i δ i + ˆb), 0 ]. (5) ax [ 1 y i ( ŵ,x i δ i + ˆb), 0 ] v(ŵ,ˆb). (6) Since the saples {x i, y i } are not separable, there exists t [1 : ] such that y t ( ŵ,x t + ˆb) < 0. With soe algebra, we have sup(δ1,,δ ) N t ax [ 1 y i ( ŵ,x i δ i +ˆb), 0 ] = v(ŵ,ˆb). Since Nt N, Inequality (5) follows. Establishing Inequality (6) is standard. The following corollary thus shows regularized SVM is a special case of robust classification. { Corollary 1. Let T (δ 1, δ ) } δ i c. If the training saples {x i, y i } are non-separable, then the following two optiization probles on (w, b) are equivalent: in : ax [ ( 1 y i w, xi δ i + b ), 0 ], (7) ax (δ 1,,δ ) T k in : c w + ax [ ( 1 y i w, xi + b ), 0 ]. (8) 3

4 This corollary explains the widely known fact that the regularized classifier tends to be robust. It also suggests that the appropriate way to regularize should coe fro a disturbance-robustness perspective, e.g., by exaining the variation of the data and solving the corresponding robust classification proble. Corollary 1 can be easily generalize to a kernelized version, i.e., a linear classifier in the feature space H that is defined as a Hilbert space containing the range of soe feature apping Φ( ). { Corollary 2. Let T H (δ 1, δ ) } δ i H c. If {Φ(x i ), y i } are non-separable, then the following two optiization probles on (w, b) are equivalent in : ax [ ( 1 y i w, Φ(xi ) δ i + b ), 0 ], (9) ax (δ 1,,δ ) T k in : c w H + ax [ ( 1 y i w, Φ(xi ) + b ), 0 ]. (10) Here, H stands for the RKHS nor, which is self-dual. Corollary 2 essentially says that the standard kernelized SVM is iplicitly a robust classifier with disturbance in the feature-space. Disturbance in the feature-space is less intuitive than disturbance in the saple space. However, the next lea relates these two different setups: under certain conditions, a classifier that achieves robustness in the feature space (the SVM for exaple) also achieves robustness in the saple space. The proof is straightforward and oitted. Lea 1. Suppose there exist X R n, ρ > 0, and a continuous non-decreasing function f : R + R + satisfying f(0) = 0, such that Then k(x,x) + k(x,x ) 2k(x,x ) f( x x 2 2), x,x X, x x 2 ρ. Φ(ˆx + δ) Φ(ˆx) H 3 Consistency of Regularization f( δ 2 2 ), δ 2 ρ, ˆx, ˆx + δ X. In this section we explore a fundaental connection between learning and robustness, by using robustness properties to re-prove the statistical consistency of the linear classifier, and then the kernelized SVM. Indeed, our proof irrors the consistency proof found in [15], with the key difference that we replace etric entropy, VC-diension, and stability used there, with robustness. In contrast to these standard techniques which often work for a liited range of algoriths, robustness arguent works for a uch wider range of algoriths and allows a unified approach to show consistency. Thus far we have considered the setup where the training-saples are corrupted by certain setinclusive disturbances, and now we turn to the standard statistical learning setup. That is, let X R n be bounded, and suppose the training saples (x i, y i ) are generated i.i.d. according to an unknown distribution P supported on X { 1, +1}. The next theore shows that our robust classifier setup and equivalently regularized SVM iniizes an asyptotical upper-bound of the expected classification error and hinge loss. Theore 2. Denote K ax x X k(x,x). Suppose there exist ρ > 0 and a continuous nondecreasing function f : R + R + satisfying f(0) = 0, such that: k(x,x) + k(x,x ) 2k(x,x ) f( x x 2 2 ), x,x X, x x 2 ρ. Then there exists a rando sequence {γ,c } independent of P such that, c > 0, li γ,c = 0, alost surely, and (w, b) H R, the following bounds on the Bayes loss and the hinge loss hold E P (1 y sgn( w, Φ(x) +b) ) γ,c + c w H + 1 ax [ 1 y i ( w, Φ(x i ) + b), 0 ], E (x,y) P ( ax(1 y( w, Φ(x) + b), 0) ) γ,c (1 + K w H + b ) + c w H + 1 ax [ 1 y i ( w, Φ(x i ) + b), 0 ]. 4

5 Proof. Step 1: We first prove the theore for a non-kernelized case. We fix a c > 0 and drop the subscript c of γ,c. A testing saple (x, y ) and a training saple (x, y) are called a saple pair if y = y and x x 2 c. We say a set of training saples and a set of testing saples for l pairings if there exist l saple pairs with no data reused. Given training saples and testing saples, we use M to denote the largest nuber of pairings. Lea 2. Given c > 0, M / 1 alost surely as +, uniforly w.r.t. P. Proof. We provide a proof sketch. Partition X into finite (say T ) sets such that if a training saple and a testing saple fall into one set, they for a pairing. This is doable due to finite-diensionality of the saple space. Let Ni tr and Ni te be the nuber of training saples and testing saples falling in the i th set, respectively. Thus, (N1 tr,, NT tr ) and (Nte 1,, NT te ) are ultinoially distributed rando vectors following a sae distribution. It is straightforward to show that T Ni tr Ni te / 0 with probability one (e.g., Bretagnolle-Huber-Carol inequality), and hence M / 1 alost surely. Moreover, the convergence rate does not depend on P. Now we proceed to prove the theore. Let N 0 = {δ δ c}. Given training saples and testing saples with M saple pairs, for these paired saples, both the total testing error and the total testing hinge-loss are upper bounded by ax (δ 1,,δ ) N 0 N 0 c w 2 + ax [ ( 1 y i w, xi δ i + b ), 0 ] ax [ 1 y i ( w, x i + b), 0]. Hence the average testing error (including unpaired ones) is upper bounded by 1 M / + c w n ax [ 1 y i ( w, x i + b), 0]. Since ax x X (1 y( w,x )) 1 + b + K w 2, the average hinge loss is upper bounded by (1 M /)(1 + K w 2 + b ) + c w ax [ 1 y i ( w, x i + b), 0 ]. The proof follows since M / 1 alost surely. Step 2:Now we generalize the result to a kernelized version. Siilarly we lower-bound the nuber of saple pairs in the feature-space. The ultinoial rando variable arguent used in the proof of Lea 2 breaks down, due to possible infinite-diensionality of the feature space. Nevertheless, we are able to lower bound the nuber of saple pairs in the feature space by the nuber of saple pairs in the saple space. Define f 1 (α) ax{β 0 f(β) α}. Since f( ) is continuous, f 1 (α) > 0 for any α > 0. By Lea 1, if x and x belong to a hyper-cube of length in(ρ/ n, f 1 (c 2 )/ n) in the saple space, then Φ(x) Φ(x ) H c. Hence the nuber of saple pairs in the feature space is lower bounded by the nuber of pairs of saples that fall in the sae hyper-cube in the saple space. We can cover X with finitely any such hyper-cubes since f 1 (c 2 ) > 0. The rest of the proof is identical to Step 1. Notice that the condition in Theore 2 requires that the feature apping is sooth and hence preserves locality of the disturbance, i.e., sall disturbance in the saple space guarantees the corresponding disturbance in the feature space is also sall. This condition is satisfied by ost widely used kernels, e.g., hoogeneous polynoinal kernels, and Gaussian RBF. It is easy to construct non-sooth kernel functions which do not generalize well. For exaple, consider the following kernel k(x,x ) = 1 (x=x ), i.e., k(x,x ) = 1 if x = x, and zero otherwise. A standard RKHS regularized SVM leads to the a decision function sign( α ik(x,x i ) + b), which equals sign(b) and provides no eaningful prediction if the testing saple x is not one of the training saples. Hence as increases, the testing error reains as large as 50% regardless of the tradeoff paraeter used in the algorith, while the training error can be arbitrarily sall by fine-tuning the paraeter. 5

6 Theore 2 can further lead to the standard consistency notion, i.e., convergence to the Bayes Risk ([15]). The proof in [15] involves a step showing that the expected hinge loss of the iniizer of the regularized training hinge loss concentrates around the epirical regularized hinge loss, accoplished using concentration inequalities derived fro VC-diension considerations, and stability considerations. Instead, we can use our robustness-based results of Theore 2 to replace these approaches. The detailed proof is oitted due to space liits. 4 Concluding Rearks This work considers the relationship between robustness and regularized SVM classification. In particular, we prove that the standard nor-regularized SVM classifier is in fact the solution to a robust classification setup, and thus known results about regularized classifiers extend to robust classifiers. To the best of our knowledge, this is the first explicit such link between regularization and robustness in pattern classification. This link suggests that nor-based regularization essentially builds in a robustness to saple noise whose probability level sets are syetric, and oreover have the structure of the unit ball with respect to the dual of the regularizing nor. It would be interesting to understand the perforance gains possible when the noise does not have such characteristics, and the robust setup is used in place of regularization with appropriately defined uncertainty set. In addition, based on the robustness interpretation of the regularization ter, we re-proved the consistency of SVMs without direct appeal to notions of etric entropy, VC-diension, or stability. Our proof suggests that the ability to handle disturbance is crucial for an algorith to achieve good generalization ability. References [1] P. Boser, I. Guyon, and V. Vapnik. A training algorith for optial argin classifiers. In Proceedings of the Fifth Annual ACM Workshop on Coputational Learning Theory, pages , New York, NY, [2] V. Vapnik and A. Lerner. Pattern recognition using generalized portrait ethod. Autoation and Reote Control, 24: , [3] K. Bennett and O. Mangasarian. Robust linear prograing discriination of two linearly inseparable sets. Optiization Methods and Software, 1(1):23 34, [4] V. Vapnik and A. Chervonenkis. The necessary and sufficient conditions for consistency in the epirical risk iniization ethod. Pattern Recognition and Iage Analysis, 1(3): , [5] A. Sola, B. Schölkopf, and K. Müller. The connection between regularization operators and support vector kernels. Neural Networks, 11: , [6] P. Bartlett and S. Mendelson. Radeacher and Gaussian coplexities: Risk bounds and structural results. Journal of Machine Learning Research, 3: , Noveber [7] V. Koltchinskii and D. Panchenko. Epirical argin distributions and bounding the generalization error of cobined classifiers. The Annals of Statistics, 30(1):1 50, [8] L. El Ghaoui and H. Lebret. Robust solutions to least-squares probles with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18: , [9] A. Ben-Tal and A. Neirovski. Robust solutions of uncertain linear progras. Operations Research Letters, 25(1):1 13, August [10] P. Shivasway, C. Bhattacharyya, and A. Sola. Second order cone prograing approaches for handling issing and uncertain data. Journal of Machine Learning Research, 7: , July [11] G. Lanckriet, L. El Ghaoui, C. Bhattacharyya, and M. Jordan. A robust iniax approach to classification. Journal of Machine Learning Research, 3: , Deceber [12] T. Trafalis and R. Gilbert. Robust support vector achines for classification and coputational issues. Optiization Methods and Software, 22(1): , February [13] M. Kearns, Y. Mansour, A. Ng, and D. Ron. An experiental and theoretical coparison of odel selection ethods. Machine Learning, 27:7 50, [14] C. Bhattacharyya, K. Pannagadatta, and A. Sola. A second order cone prograing forulation for classifying issing data. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Inforation Processing Systes (NIPS17), Cabridge, MA, MIT Press. [15] I. Steinwart. Consistency of support vector achines and other regularized kernel classifiers. IEEE Transactions on Inforation Theory, 51(1): ,

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering