NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES

J Syst Sci Complex () 24: 466 476 NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES Kun ZHAO Mingyu ZHANG Naiyang DENG DOI:.7/s424--82-8 Received: March 8 / Revised: 5 February 9 c The Editorial Office of JSSC & Springer-Verlag Berlin Heidelberg Abstract This paper proposes robust version to unsupervised classification algorithm based on modified robust version of primal problem of standard SVMs, which directly relaxes it with label variables to a semi-definite programg. Numerical results confirm the robustness of the proposed method. Key words Robust, semi-definite programg, support vector machines, unsupervised learning. Introduction As an important branch in unsupervised learning, clustering analysis aims at partitioning a collection of objects into groups or clusters so that members within each cluster are more closely related to one another than objects assigned to different clusters []. Clustering algorithms provide automated tools to help identifying a structure from an unlabeled set, in a variety of areas including bio-informatics, computer vision, information retrieval and data ing. There is a rich resource of prior work on this subject. Efficient convex optimization techniques have had a profound impact on the field of machine learning. Most of them have been used in applying quadratic programg techniques to support vector machines (SVMs) and kernel machine training [2]. Semi-definite programg (SDP) extends the toolbox of optimization methods used in machine learning, beyond the current unconstrained, linear and quadratic programg techniques. One of its main attractions is that it has been proven successful in constructing relaxation of NP-hard problem. Data uncertainty is present in many real-world optimization problems. For example, in supply chain optimization, the actual demand for products, financial returns, actual material requirements and other resources are not precisely known when critical decisions need to be made. In engineering and science, data is subjected to measurement errors, which also constitute sources of data uncertainty in the optimization model [3]. Kun ZHAO Logistics School, Beijing Wuzi University, Beijing 49, China. Mingyu ZHANG School of Economic and Management, Beijing Jiaotong University, Beijing 44, China. Naiyang DENG (Corresponding author) College of Science, China Agricultural University, Beijing 83, China. This research is supported by the Key Project of the National Natural Science Foundation of China under Grant No. 637. This paper was recommended for publication by Editor Xiaoguang YANG.

NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES 467 In mathematical optimization models, we commonly assume that the data inputs are precisely known and ignore the influence of parameter uncertainties on the optimality and feasibility of the models. It is therefore conceivable that as the data differs from the assumed noal values, the generated optimal solution may violate critical constraints and perform poorly from an objective function point of view [3]. This observation raises the natural question of designing solution approaches that are immune to data uncertainty; that is, they are robust [4]. Robust optimization addresses the issue of data uncertainties from the perspective of computational tractability. ThefirststepinthisdirectionwastakenbySoyster [5]. A significant step forward for developing robust optimization was taken independently by Ben-Tal and Nemirovsk [6 8], EI-Ghaoui and Lebret [9], EI-Ghaoui, et al []. To overcome the issue of over conservatism, these papers proposed less conservative models by considering uncertain linear problems with ellipsoidal uncertainties, which solve the robust counterpart of the noal problem in the form of conic quadratic problems. Melvyn [3] proposed a new robust counterpart, which inherits the characters of the noal problems. We briefly outline the contents of the paper now. We review standard SVMs, SDP and robust linear optimization in Section 2. Section 3 formulates robust unsupervised classification algorithm which is based on primal problem of standard SVMs. Numerical results will be showed in Section 4. In Section 5, we have a conclusion. A word about our notation. All vectors will be column vectors unless transposed to a row vector by T. The scalar product of two vectors x and y in the n-dimensional real space R n will be denoted by x T y. For an l d matrix A, A i will denote the ith row of A. The identity matrix in a real space of arbitrary dimension will be denoted by I, while a column vector of ones of arbitrary dimension will be denoted by e. 2 Preliaries 2. Support Vector Machines Considering the supervised classification problem, and training set is T = {(x,y ), (x 2,y 2 ),,(x l,y l )}, wherex i R n, y i {, +} is corresponding output of input x i. The goal of SVMs is to find the linear classifier f(x) =w T x + b that maximizes the imum misclassification margin w R n,b R,ξ R l 2 w 2 + C ξ i () s.t. y i (w T x i + b) ξ i, i =, 2,,l, (2) ξ i, i =, 2,,l. (3) The problem () (3) is the primal problem of standard support vector machines (C-SVMs) []. Consistency of support vector machines is well understood [2 3]. 2.2 Semi-Definite Programg A recent development of convex optimization theory is semi-definite programg, a branch of that fields aimed at optimizing over the cone of semi-positive definite matrices. In semidefinite programg one imizes a linear function subject to the constraint that an affine combination of symmetric matrices is positive semi-definite. Such a constraint is nonlinear and nonsmooth, but convex, so semi-definite programs are convex optimization problems. Semi-

468 KUN ZHAO MINGYU ZHANG NAIYANG DENG definite programg can be regarded as an extension of linear programg, where the componentwise inequalities between vectors are replaced by matrix inequalities, or, equivalently, the first orthant is replaced by the cone of positive semi-definite matrices. Semi-definite programg unifies several standard problems (e.g., linear and quadratic programg). Although semi-definite programs are much more general than linear programs, they are not much harder to solve. Most interior-point methods for linear programg have been generalized to semidefinite programs. Given C M n, A i M n, i =, 2,,m and b =(b,b 2,,b m ) T R m,wherem n is the set of symmetric matrix. The standard SDP problem is to find a matrix X M n for the optimization problem (SDP) s.t. C X A i X = b i, i =, 2,,m, X, where the operation is the matrix inner product, i.e., A B =tra T B, the notation X means that X is a positive semi-definite matrix. The dual problem of SDP can be written as: max (SDD) s.t. b T λ C m λ i A i. Here λ R m. Interior point method has good effect for SDP. Moreover, there exist several softwares such as SeDuMi [4] and SDPT3 [5]. 2.3 Robust Linear Optimization The general optimization problem under parameter uncertainty is as follows: max f (x, D ) (4) s.t. f i (x, D i ), i I, (5) x X, (6) where f i (x, D i ), i {} I are given functions, X is a given set and D i, i {} I is the vector of uncertain coefficients. When the uncertain coefficients D i take values equal to their expected values D i, it is the noal problem of problem (4) (6). In order to address parameter uncertainty problem (4) (6), Ben-Tal and Nemirovski [6 7] and independently EI-Ghaoui, et al. [9 ] proposed to solve the following robust optimization problem max s.t. f (x, D ) D U (7) f i (x, D i ), i I, D i U i (8) x X, (9) where U i, i {} I are given uncertainty sets. In the robust optimization framework (7) (9), Melvyn Sim considered the uncertainty set U as follows { U = D u R N : D = D + } D j u j, u Ω, j N

NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES 469 where D is the noal value of the data, D j (j N) is a direction of data perturbation, and Ω is a parameter controlling the tradeoff between robustness and optimality (robustness increases as Ω increases). We restrict the vector norm as u = u +,suchasl and l 2. When we select the norm as l, then the corresponding perturbation region is a polyhedron and the robust counterpart is also a linear programg (LP), detailed robust counterpart can be seen in [3]. 3 Robust Unsupervised Classification Algorithm Semi-definite programg has showed its utility in machine learning. Lanckreit, et al. [6] showed how the kernel matrix can be learned from data via SDP techniques. They presented new methods for learning a kernel matrix from labeled data set and transductive data set. Both methods can relax the problem to SDP. For a transductive setting, using the labeled data, one can learn a good embedding (kernel matrix), which can then be applied to the unlabeled part of the data. De Bie and Cristanini [7] relaxed two-class transduction problem to SDP based on transductive SVMs. Xu, et al. [8] developed methods to two-class unsupervised and semi-supervised classification problems based on bounded C-SVMs in virtue of relaxation to SDP in the foundation of [6 7]. Its purpose is to find a labeling which has the maximum margin, not to find a large margin classifier. This leads to the method to cluster the data into two classes, which subsequently run an SVM, and will obtain the maximum margin with all possible labelings. Constraint about class balance ε l y i ε should be added (ε is an integer), otherwise we can simply assign all the data to the same class and then get unbounded margin; moreover, this can avoid noisy data s influence in some sense. Zhao, et al. [9 ] presented other versions which are based on Bounded ν-svms and Lagrangian SVMs, respectively. The virtues of the first algorithm are the facility of selecting parameter and better classification results, while the superiority of second algorithm is the much less consumed CPU seconds than other algorithms [8 9] for the same data set. All of the unsupervised classification algorithms mentioned above have complicated procedures of relaxing NP-hard problem to semi-definite programg, because they all need to find the dual problem twice. In order to avoid the complicated procedure mentioned in last paragraph, Zhao, et al. [2] directly relaxed a modified version of primal problem of SVMs with label variables to a semi-definite programg. In the methods mentioned above, the training data in the optimization problems are implicitly assumed to be known exactly. However, in practice, these training data have perturbations since they are usually corrupted by measurement noise. Zhao, et al. [22] proposed robust unsupervised and semi-supervised classification algorithms based on Bounded C-SVMs, which have complicated procedures because they have to find the dual problem twice. For the sake of avoidance of the complicated procedure, we will proposed robust unsupervised classification algorithm with polyhedron perturbation based on primal problem of standard SVMs. When considering the measurement noise, we assume training data x i R n, i =, 2,,l, which has perturbed as x i, concretely, x ij = x ij + x ij z ij,, 2,,l, j =, 2,,n, z i p Ω. z i is a random variable, when select its norm as l norm, that is, z i Ω, while it is equivalent with n j= z ij Ω,, 2,,l. Considering x ij = x ij + x ij z ij, n x ij x ij Ω, i =, 2,,l, x ij j= so the perturbation region of x i is a polyhedron.

47 KUN ZHAO MINGYU ZHANG NAIYANG DENG Considering primal problem of standard SVMs, and the training data have perturbations as mentioned above, then we get the optimization problem 2 wt w + C ξ i () s.t. y i ((w x i )+b) ξ i, () w,b,ξ ξ i,, 2,,l. (2) Constraint () is infinite and problem () (2) is semi-infinite optimization problem, there seems no good method to resolve it directly. Due to robust linear optimization, we tend to find its robust counterpart. In Melvyn s proposed robust framework [3],constrainty i ((w x i )+b) ξ i is equivalent to Then we get robust primal problem of standard SVMs w,b,ξ,t y i ((w x i )+b) +ξ i Ωt i, t i, (3) x ij w j y i t i, j =, 2,,n, (4) x ij w j y i t i, j =, 2,,n. (5) 2 wt w + C ξ i (6) s.t. y i ((w x i )+b) +ξ i Ωt i, (7) t i, (8) x ij w j y i t i, j =, 2,,n, (9) x ij w j y i t i, j =, 2,,n, () ξ i, i =, 2,,l. (2) When the labels y i,, 2,,l are unknown, an NP-hard optimization problem for unsupervised classification problem is formulated as follows: 2 wt w + C ξ i (22) s.t. y i ((w x i )+b) +ξ i Ωt i, (23) y i {, } l,w,b,ξ,t t i, ξ i, i =, 2,,l, (24) x ij w j y i t i, j =, 2,,n, (25) x ij w j y i t i, j =, 2,,n, (26) ε y i ε. (27) Now, we modify the above formulation in order to solve unsupervised classification problem appropriately. When variable ξ i and t i are replaced by ξi 2 and t 2 i, respectively, the constraints (24) can be dropped. So we get the robust modified primal problem of SVMs for unsupervised classification problem y i {, } l,w,b,ξ,t 2 w 2 + C ξi 2 (28)

NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES 47 Setting variable λ =(y T,w T,b,ξ T,t T ) T,weget where s.t. y i ((w x i )+b) +ξi 2 Ωt 2 i, (29) x ij w j y i t 2 i, j =, 2,,n, () x ij w j y i t 2 i, j =, 2,,n, (3) ε y i ε. (32) λ 2 λt A λ (33) s.t. λ T A i λ, i =, 2,,l, (34) λ T B ij λ, i =, 2,,l, j =, 2,,n, (35) λ T D ij λ, i =, 2,,l, j =, 2,,n, (36) ε (e T l, T n+2l+ )λ ε, (37) A =Diag( l l, I n n,, 2CI l l, l l ), (38) l l Ai l (n+) l l l l A i = Ai T l (n+) (n+) (n+) (n+) l (n+) l l l l (n+) Ai2 l l l l, (39) l l l (n+) l l Ai3 l l l l Bij l (n+) l l l l B ij = Bijl (n+) T (n+) (n+) (n+) l (n+) l l l l (n+) l l l l, (4) l l l (n+) l l Ai2 l l l l Bij l (n+) l l l l D ij = Bijl (n+) T (n+) (n+) (n+) l (n+) l l l l (n+) l l l l, (4) l l l (n+) l l Ai2 l l Ai l (n+) (i =, 2,,l) is the matrix whose elements in ith row are 2 (x i T, ) and the rest elements of it are all zeros. Ai2 l l (i =, 2,,l) is the matrix whose element in ith row ith column element is and the rest elements of it are all zeros. Ai3 l l (i =, 2,,l)isthematrix which its element in ith row ith column element is Ω and the rest elements of it are all zeros. Bij l (n+) is the matrix which its element in ith row ith column element is 2 x ij and the rest elements of it are all zeros. Let λλ T = M and relax λλ T = M to M and diag(m) l = e l. Then we get semi-definite programg problem M 2 Tr(MA ) (42) s.t. Tr(MA i ), i =, 2,,l, (43) Tr(MB ij), i =, 2,,l, j =, 2,,n, (44)

472 KUN ZHAO MINGYU ZHANG NAIYANG DENG Tr(MD ij), i =, 2,,l, j =, 2,,n, (45) εe M l (e T l, T +2l+n )T εe, (46) M, diag(m) l = e l, (47) where diag(m) l denotes the first l diagonal elements of matrix M, andm l denotes the first l rows of matrix M. After getting the optimal solution M to the problem (42) (47), training data s labels y T are obtained by the following rounding method. Rounding method ) Find the first l elements t =(η,η 2,,η l ) T in eigenvector corresponding to the largest eigenvalue of the matrix M. Construct the vector y =(y, y 2,, y l ) T =(sgn(η ), sgn(η 2 ),, sgn(η l )) T.Ify satisfy the constraint ε l y i ε, sety = y, which is final label of data and two classes of data are clustered. 2) If y dose not satisfy the constraint ε l y i ε, letδ = y T e ε, the labels y is obtained from y by the following way: Select δ smallest absolute values of η i from the majority class, and change the corresponding labels in y. 4 Numerical Results In this section, through numerical experiments, we will test our algorithm (RPSDP) with PRC-SDP [22] on various data sets using SeDuMi library with different robust tradeoff parameter Ω. We select synthetic data sets AI and Two-circles, which have 9 points and points in R 2 respectively. Select parameters ε =andc = and directions of data perturbations are produced randomly. Results are showed in Tables and 2. The number is the misclassification percent. Table Results about different parameter Ω on AI with RPSDP and PRC-SDP Ω.25.5.2 RPSDP 2/9 /9 5/9 6/9 5/9 PRC-SDP /9 2/9 3/9 5/9 3/9 Table 2 Results about different parameter Ω on two-circles with RPSDP and PRC-SDP Ω.25.5.75.5 2 3 4 5 6 RPSDP / / 2/ 2/ 2/ 2/ 2/ 2/ 2/ 2/ PRC-SDP 5/ 6/ 6/ 6/ 9/ 8/ 7/ 7/ 7/ 7/

NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES 473 35 25 25 5 5 5 2 4 6 8 2 4 6 8 5 5 5 35 35 25 25 5 5 5 5 5 5 5 5 35 25 5 5 5 5 Figure Results by RPSDP on AI with different parameter Ω 4 4 4 4

474 KUN ZHAO MINGYU ZHANG NAIYANG DENG 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 Figure 2 Results by RPSDP on Two-circles with different parameter Ω

NEW ROBUST UNSUPERVISED SUPPORT VECTOR MACHINES 475 We also conduct our algorithm on Digits data sets which can be obtained from http://www.cs. toronto.edu/roweis/data.html. Selecting Numbers 3 and 2, Numbers 7 and, Numbers 9 and and Numbers 5 and 6, as data sets respectively. Every Number has five samples, which has 256 dimensions. Due to RPSDP have (n +3l +) 2 variables (l and n are the number and dimension of training data, respectively), while SeDuMi has difficulty to deal with much more variables when it is used to solve SDP. Therefore, it seems better to reduce the dimension of training data. In use of principal component analysis, the dimension reduced from 256 to 9. As the same to synthetic data sets, in order to evaluate the influence of the robust tradeoff parameter Ω, we will set value of Ω from.5 to 5, and directions of data perturbations are produced randomly. To evaluate robust classification performance, a labeled data set was taken and the labels are removed, then run robust unsupervised classification algorithms, and labeled each of the resulting class with the majority class according to the original training labels, then measured the number of misclassification. The results are showed in Table 3 and the number is the misclassification percent. Table 3 Results by RPSDP on digits data sets with different parameter Ω Ω Digit23 Digit7 Digit9 Digit56.5 2/ 2/ 3/ 2/ 2/ 2/ 2/ 2/.5 2/ / 2/ 2/ 2.5 2/ / 2/ 2/ 5 4/ 2/ 3/ 2/ 5 Conclusions In this paper we have established the robust unsupervised classification algorithm based on primal problem of standard SVMs. Principal component analysis has been utilized to reduce dimension of training data, because the conic convex optimization solver SeDuMi used in our numerical experiments can only solve problems of small data sets. From Section of numerical results we can learn that RPSDP has been less effected on robust parameter Ω than PRC-SDP. In the future, we will continue to estimate the approximation of SDP relaxation and get an approximation ratio of the worst case. References [] J. A. Hartigan, Clustering Algorithms, John Wiley & Sons, New York, NY, 975. [2] B. Schoelkopf and A. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Massachusetts, Cambridge, 2. [3] S. Melvyn, Robust optimization, PhD. Thesis, Massachusetts Institute of Technology, 4. [4] B. Dimitris and S. Melvyn, The price of robustness, Opertions Research, 4, 52(): 35 53. [5] A. L. Souster, Convex programg with set-inclusive constraints and applications to inexact linear programg, Oper. Res., 973, 2: 54 57. [6] A. Ben-Tal and A. Nemirovski, Robust convex optimization, Math. Oper. Res., 998, 23: 769 85. [7] A. Ben-Tal and A. Nemirovski, Robust solutions to uncertain programs, Oper. Res. Letters, 999, 25: 3. [8] A. Ben-Tal and A. Nemirovski, Robust solutions of linear programg problems constrained with uncertain data, Math. Program.,, 88: 4 424. [9] L. El-Ghaoui and H. Lebret, Robust solutions to least-square problems to uncertain data matrices, SIAM J. Matrix Anal. Appl., 997, 8: 35 64.

476 KUN ZHAO MINGYU ZHANG NAIYANG DENG [] L. El-Ghaoui, F. Oustry, and H. Lebret, Robust solutions to uncertain semidefinite programs, SIAM J. Optim., 998, 9(): 33 52. [] N. Y. Deng and Y. J. Tian, A New Method of Data Mining: Support Vector Machines, Science Press, Beijing, 4. [2] D. R. Chen, et al., Support vector machine soft margin classifiers: Error analysis, Machine Learning Research, 4(5): 43 75. [3] S. Smale and D. X. Zhou, Learning theory estimates via integral operators and their approximations, Constr. Approx., 7, 26: 53 72. [4] J. F. Sturm, Using SeDuMi.2, A Matlab toolbox for optimization over symmetric cones, Optimization Methods and Software, 999, ( 2): 625 653. [5] K. C. Toh, M. J. Todd, and R. H. Tütüncü, SDPT3 A Matlab package for semidefinite programg, Technical Report, School of Operations Research and Industrial Engineering, Cornell University, Ithaca, NY, USA, 996. [6] G. Lanckriet, et al., Learning the kernel matrix with semidefinite programg, Journal of Machine Learning Research, 4, (5). [7] T. De Bie and N. Crisrianini, Convex methods for transduction, Advances in Neural Information Processing Systems, 6(NIPS-3), 3. [8] L. Xu, et al., Maximum margin clustering, Advances in Neural Information Processing Systems, 7(NIPS-4), 4. [9] K. Zhao, Y. J. Tian, and N. Y. Deng, Unsupervised and semi-supervised two-class support vector machines, Sixth IEEE Internaitonal Conference on Data Minging Workshops, 6. [] K. Zhao, Y. J. Tian, and N. Y. Deng, Unsupervised and semi-supervised lagrangian support vector machines, Internaitonal Conference on Computational Science Workshops, Lecture Notes in Computer Science, 7, 4489(5): 882 889. [2] K. Zhao, Y. J. Tian, and N. Y. Deng, New unsupervised support vector machines, Proceeding of MCDM, CCIS, 9, 35: 66 63. [22] K. Zhao, Y. J. Tian, and N. Y. Deng, Robust unsupervised and semi-supervised bounded C-support vector machines, Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, Omaha NE, USA, October, 7.