Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering University of Oklahoa 202 W. Boyd Street, Roo 124, oran, Oklahoa - 73019 UITED STATES rpant@ou.edu, ttrafalis@ou.edu, kashbarker@ou.edu Abstract: - In this paper, we have developed a robust Support Vector Machines () schee of classifying ibalanced and noisy data using the principles of Robust Optiization. Uncertainty is prevalent in alost all datasets and has not been addressed efficiently by ost data ining techniques, as these are based on deterinistic atheatical tools. Ibalanced datasets exist while perforing analysis of rare events, and for such datasets eleents in the inority class becoe critical. Our ethod tries to address both issues lacking in traditional classifications. At present, we provide solutions for linear classification of data having bounded uncertainties. This can be extended to non-linear classification schees for any types of uncertainties that are convex. Our results in predicting the iportance of the inority class are better than the traditional soft-argin classification. Preliinary coputational results are presented. Key-Words: - Support Vector Machines, Robust Classification, Ibalance, Uncertainty, oise 1 Introduction Data classification is an iportant proble in the field of data ining. Classification refers to dividing the data into classes, where each class signifies certain properties that are coon to a set of data. The siplest classification is the perfect linear separation of data into two classes. In practical probles, data is not perfectly separable, due to which there are errors in classification. Further issues aking data analysis coplicated arise due to the presence of uncertain and ibalanced datasets. Uncertainty is prevalent is alost all datasets and is not addressed efficiently by ost data ining techniques, as these are based on deterinistic atheatical tools [8]. Ibalanced datasets exist while perforing analysis of rare events, for which eleents in the inority class becoe critical. Most of the data ining techniques perfor poorly in predicting the inority class for ibalanced data [3]. The solution of classification probles using Support Vector Machines (s) [9,10] is widely prevalent in data ining. The soft-argin classification ethods can provide efficient solutions to non-separable data, but due to uncertainty, the traditional soft-argin classification ight not be copletely effective in providing the optial classification. In general the ter "noisy or uncertain data" in classification is eant to represent those exaples that do not lie on the intended side of the separation argin. In this paper we extend this ter to include the uncertainty that anifests itself in every data point. Hence, our interpretation of noisy data consists of an error in easureent of each data point that has to be considered during classification and also there would be soe data points which will not be classified correctly. As stated earlier, traditional ethods adjust for the error relative to the axiu argin of classification, but do not consider individual data uncertainties. Bhattacharyya et al. [2] addressed such issues by developing Second Order Conic Prograing forulations for Gaussian uncertainty in data, which bore reseblance to the Total ethods of Bi and Zhang [4] that provided forulations for bounded uncertainties. While these ethods were developed separately they fall under the schee of Robust approaches detailed in works of Trafalis et al. [5,6,7,8], which uses concepts developed in Robust Optiization (RO) literature [1]. RO schees have been applied to to explore both data uncertainties and classification errors and results of the robust are found to perfor better than the traditional ethods. The RO techniques for can be extended to the study of ibalanced datasets. Exaples of such ISB: 978-1-61804-019-0 369

Recent Researches in Coputer Science datasets are tornado datasets where out of a very large database only a few events are catastrophic and hence, iportant for decision-aking. RO ethods can be used to control the perturbations in data points, which in turn controls isclassification errors that affect the ability to successfully predict the inority class. Such ethods have been explored in this study. We look at linear separation classification for uncertain and ibalanced datasets. Using RO, we propose an optiization proble that provides a better solution to ibalanced, noisy data classification than the classical soft-argin classification. The ain contribution of this work is to forulate a classification proble that solves for ibalanced and noisy data. We have presented the forulation and results for convex bounded uncertainty datasets. This paper is organized as follows. Section 2 explains the proble stateent for classification of uncertain data, and we present the soft-argin classification proble stateent. Section 3 discusses the developent of the robust classification proble for noisy data, wherein the robust counterpart of the classical proble is developed and the final optiization equation for noisy data is presented. In Section 4, we extend the robust forulation to include ibalanced data analysis and present the final optiization proble that handles data with Euclidean nor bounded uncertainty. in Section 5 we perfor preliinary nuerical analyses for a few toy datasets and copare the perforance of our ethod with the classical based svtrain function of MATLAB. Section 6 discusses the conclusion and future developent of this work. 2 Proble Forulation For classification, we are given an input atrix X R xn of training saples. Each data point, x i X, is an n eleent row vector and there are such data points. Further, we are given that each data point belongs to either one of the two classes given as y {+1, 1}. The pair {X, y} is referred to as the training dataset. Uncertainty is incorporated into the analysis by assuing that the input atrix is given in ters of a noinal value and a perturbation, that is, X = X + Δ: Δ = δ 1, δ 2,, δ, (1) where, X is the noinal value of the data that is free fro uncertainties, Δ is the uncertainty set for defining the perturbations in the data in which δ i is the n eleent row vector of uncertainty associated with each data point x i. Suitable definitions about Δ have to be provided for obtaining feasible solutions to the classification probles. As a rule, it is assued that Δ has to belong to a convex set that gives coputationally tractable solutions. The proble we ai to solve is the two-class classification proble in which the aount of data belonging to one class y i = +1 is very few copared to the other class y i = 1. Further, the data has uncertainties, due to which there will be errors in classification. In the next subsections below we state our classification proble and suggest the forulation to handle uncertainties. We will further develop the classification schee to address both probles of data ibalance and noise. 2.1 Soft-argin Classification for onseparable Data For perfect linear separation of data the classification rule is defined in ters of a separating hyperplane and is given as y i w, x i + b 1, i = 1, 2,,, (2) where w R n is a weight vector perpendicular to the hyperplane and b is a scalar for deterining the offset of the hyperplane fro the origin. The ter w, x i signifies the vector dot product between the eleents of w and x i. The ai of classification is to find the axiu argin of separation between data points, which is stated through the optiization proble 1 in 2 w 2 2 (3) s. t. y i w, x i + b 1, i = 1, 2,,. The classification of data is generally not exact because in actual situations the data is too coplex to be perfectly linear separable. Hence, a ter for the error in classification has to be incorporated into the analysis. Due to errors in classification, the traditional hard-argin classification constraints are odified as y i w, x i + b 1 ξ i, (4) ξ i 0, i = 1, 2,,, where, ξ i is the scalar error in classification of the i th data point x i. The ai of an efficient classifier is to iniize the errors in classification. This is accoplished by iniizing the su of all the errors in classification, which is referred to as the realized hinge loss function [1]. Hence, the optiization proble to be solved for controlling errors becoes ISB: 978-1-61804-019-0 370

Recent Researches in Coputer Science in ξ i (5) s. t. ξ i + b, ξ i 0, i = 1, 2,,. The classical soft-argin classification proble cobines the objectives of the optiization probles given by equations (3) to (5). Hence, the classical proble forulation becoes in λ w 2 2 + ξ i s. t. ξ i + b, ξ i 0, i = 1, 2,,, where, λ is the regularization paraeter. (6) 3 Robust Forulation of the Softargin Classification for oisy Data Fro the forulation of (6) it can be seen that the axial argin of separation is influenced by the realizations of the data points. In particular there are a few data points, called the support vectors that deterine the separation argin. If the data are uncertain then the region of influence of the support vectors is varied and we obtain ultiple solutions for the axial argin. Hence, it is intuitive to look at the worst-case realizations for data points as these would give the extree separation argin. RO ethods are therefore useful tools for accoplishing the task of finding the best separation argin under uncertainty. Solving the optiization proble of (6) can becoe coputationally intensive, especially if each data point has as unique uncertainty. Moreover, it is not possible to obtain coputationally tractable solutions unless certain rules for the uncertainty set are specified. The inequalities of (6) allow us to rewrite the hinge loss function in ters of the sapling data, labels, weight vector and the offset as ξ i = + b +, (7) where, + b + = ax 0,1 yiw,xi+ b. This forulation is siilar to an indicator function and has a convex upper bound. Hence, we can restate the soft-argin classification proble (6) as an unconstrained optiization proble given by in λ w 2 2 + + b +. (8) This optiization proble otivates the forulations that lead to a robust analysis of data with noise. Fro (8) we can see that the new softargin classification forulation is easier to solve for noisy data, as we get a coputationally tractable forulation. For a robust analysis, we will iniize the worst-case hinge loss function due to uncertain data. The robust counterpart of (8) becoes in λ w 2 2 + ax x i X + b + (9) The significance of robust optiization principles in solving classification probles lies in the fact that it solves for the extree case of the data uncertainty. The geoetrical representation of data points with spherical uncertainty is shown in Fig.1, where for each data point the centre of the sphere represents its noinal value and the radius of the sphere represents the uncertainty. In the classical forulation the support vectors correspond to the centre of data points, which does not provide uch scope for change. As seen in Fig.1, using the robust optiization ethods the support vectors would be tangential to the spherical boundary of the perturbed data. Thus the solutions becoe sensitive to the radius of each sphere and can also result in ore points becoing support vectors. For ibalanced data sets this becoes iportant as it allows us to have ore points as support vectors for the inority class, which would include soe points that were being treated as outliers otherwise. Figure 1. Coparison of classical with robust 3.1 Incorporating Uncertainties in the Softargin Classification We revisit the forulation of the realized hinge loss function in (7) for incorporating uncertainties in the forulation. Dividing x i s into their noinal values, x i s, and uncertainties, δ i s, the new forulation for (7) becoes ISB: 978-1-61804-019-0 371

Recent Researches in Coputer Science ξ i = + b (10) y i w, δ i +, Hence, in the robust counterpart of the realized hinge loss function we are concerned with finding the worst-case realization of the uncertainty, which does not involve the noinal data. In our robust forulation of (9) the worst-case hinge loss function is therefore dependent upon the worst-case realizations of the data perturbation, which eans the second ter of (9) can be expressed as ax δ i Δ + b y i w, δ i + (11) In (11) above, we can take the axiu inside the suation because of the convexity of the hinge loss function. One way of specifying the worst-case realization for the uncertainty is through the Cauchy-Schwarz inequality, which provides nor bounds on the data perturbations. For ost data, assuing nor upper bounds for perturbations is as justifiable assuption, which leads to convex forulations. The Cauchy-Schwarz bounds for y i w, δ i in (11) are given as 1 y i w, δ i δ i p w q, p + 1 q = 1 (12) Hence, fro (12) we obtain the following condition δ i p w q y i w, δ i δ i p w q, (13) which leads to the following worst-case robust forulation for (11) ax δ i Δ + b y i w, δ i + = (14) Cobining (9) and (14) gives us the robust for solving the classification proble when we have data uncertainty or noise. We restate our final robust proble, which we will develop to handle ibalanced data λ w 2 2 + in. (15) 4 Handling ibalanced data using Robust Classification For handling data ibalance, the training data saple can be partitioned into exaples with positive and negative labels respectively. We are interested in solving the following robust optiization proble λ w 2 2 + in y i=+1 (16) + y i= 1 Allowing this separation of the data into positive and negative saples helps us control the perturbation on the saples, which can be critical in including the iportant inority class saples in classification. The unconstrained optiization proble can be converted into a constrained optiization proble by assuing that the hinge loss functions are less than soe axiu values. Matheatically this is expressed as λ w 2 2 + τ 1 + τ 1 in s. t. y i=+1 y i= 1 τ 1, 1 yi w, xi τ 1. (17) where τ 1 and τ 1 are respectively the values which bound the su of errors in the positive and negative saples. A coon for of uncertainty bound which is used for data perturbations is the 2-nor or Euclidean uncertainty bound. For ost kinds of data it is assued that either the entire uncertainty in the dataset has a Euclidean bound ( Δ 2 r, r 0), or each data point is contained in an uncertainty sphere of fixed radius ρ( δ i 2 ρ). Depending upon the data ibalance we can ipose different bounds on the positive and negative saples respectively, in order to iprove our classification. The robust forulation, when uncertainty exists in each data point, becoes a conic prograing proble, which is stated as in λ w 2 2 + τ 1 + τ 1 s. t. y i=+1 y i= 1 + b + ρ 1 w 2 + τ 1, 1 yi w, xi + b + ρ 1 w q + τ 1. (18) 5 Data Analysis For checking the effectiveness of the robust analysis schee developed in this study, we generate soe ibalanced data and copare the perforance of our forulation with the soft-argin ISB: 978-1-61804-019-0 372

Recent Researches in Coputer Science classification of MATLAB. We generate a 400x2 atrix of rando nubers, where each nuber x ij belongs to the set x ij = {0.5 + rand(0,1)}. Each data point in the analysis is the 2x1 vector x i = [x i1 ; x i2 ]. The Euclidean nor of each data point is calculated and if it is greater than one a label of y i = 1 is assigned to it, otherwise it belongs to the class y i = +1. This creates an ibalanced data set in which the positive labels belong to the inority class. ext we add an uncertainty δ i to each data point such that the Euclidean nor of that uncertainty is less than ρ = 0.01. For the resulting data, we perfor analysis using the svtrain function of MATLAB and the code for our robust forulation also written in MATLAB. By controlling the separation criteria for the nor to be y i = 1: x i 2 < 1 + α(rand(0,1)), we can generate different ibalanced data sets. These are shown in Fig. 2 to 4. Figure 2. Very high ibalance in data with 10% data in inority class Figure 3. High ibalance in data with 25% data in inority class It can be observed fro the data generated that as we decrease the ibalance, we also have ore data points isclassified on either side of the separation argin, which would decrease over prediction accuracy. Fixing λ = 0.1, we use 50% of the data for training, 20% for tuning and the reaining 30% for testing the classification results of the two schees. In Table 1, 2 and 3 we show confusion atrices of the classifications of the different degrees of ibalance. It can be seen that for each case the inority class is predicted with greater accuracy using the robust schee. Hence, the results show that the ethods developed can iprove the perforance in prediction of the inority class. Table 1. Coparison of % accuracy between MATLAB and robust in predicting class for very high ibalanced data True Label Method Predicted -1 +1 Label MATLAB -1 100% 18.2% Robust +1 0% 81.2% -1 90.4% 9.1% +1 9.6% 90.9% Table 2. Coparison of % accuracy between MATLAB and robust in predicting class for high ibalanced data True Label Method Predicted -1 +1 Label MATLAB -1 97.6% 25.8% Robust +1 3.4% 74.2% -1 79.2% 22.6% +1 20.8% 77.4% Table 3. Coparison of % accuracy between MATLAB and robust in predicting class for oderate ibalanced data True Label Method Predicted -1 +1 Label MATLAB -1 91.6% 35.7% Robust +1 8.4% 64.3% -1 75.5% 30.1% +1 24.5% 69.6% Figure 4. Moderate ibalance in data with 40% data in inority class 6 Conclusion In this paper we have been able to develop a RO based schee for robust classification of ibalanced and noisy data. The ethods have been developed for linear classification of data into two class. They work on the assuption that the ISB: 978-1-61804-019-0 373

Recent Researches in Coputer Science uncertainties are convex and bounded. This is a reasonable assuption as it applies to ost practical data. Our ethod is shown to perfor better than the classical soft-argin classification in predicting the iportance of the inority class. The ethod has to be developed further to iprove its overall accuracy. Also, it can be extended to include any type of uncertainty easures that are convex. A non-linear robust classification procedure can also be developed using the sae principles presented in this work. Further developent of these ethods would increase their application in classification and prediction of various types of datasets. References: [1] A. Ben-Tal, L. El Ghaoui and A. eirovski, Robust Optiization. Princeton, J: Princeton University Press, 2009. [2] C. Bhattacharyya, K.S. Pannagadatta, and A. J. Sola, A second order cone prograing forulation for classifying issing data, In Advances in eural Inforation Processing Systes, Vol. 17, 2004. [3] G.M. Weiss and F. Provost, Learning when training data are costly: the effect of class distribution on tree induction, Journal of Artificial Intelligence Research, Vol. 19, o. 1, 2003, pp. 315-354. [4] J. Bi and T. Zhang, Support vector classification with input data uncertainty. In Advances in eural Inforation Processing Systes, Vol. 17, 2004. [5] T.B. Trafalis and R.C. Gilbert, Maxiu Margin Classifiers with oisy Data: A Robust Optiization Approach, in Proceedings of the International Joint Conference on eural etworks (IJC) 2005, Piscataway, J, USA, 2005, pp. 2826-2830. [6] T.B. Trafalis and R.C. Gilbert, Robust classification and regression using support vector achines, European Journal of Operational Research, Vol. 173, o. 3, 2006, pp. 893-909. [7] T.B. Trafalis and R.C. Gilbert, Robust Support Vector Machines for Classification and Coputational Issues, Optiization Methods and Software, Vol. 22, o. 1, 2007, pp. 187-198. [8] T.B. Trafalis and S.A. Alwazzi, Support vector regression with noisy data: A second order cone prograing approach, International Journal of General Systes, Vol. 36, o. 2, 2007, pp. 237-250. [9] V.. Vapnik, The nature of statistical learning theory, Springer-Verlag, 1995. [10] V.. Vapnik, Statistical Learning Theory, Wiley, 1998. ISB: 978-1-61804-019-0 374