SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning, support vector machines SVMs are one of the best supervised learning models for data classification and regression analysis. Sequential imal optimization is an algorithm for solving the quadratic programg QP problem that arises in SVMs. It was invented by John Platt at Microsoft Research in 998 [3] and is widely used for solving SVM models. Primal-Dual interior method for Convex Objectives is a Matlab primal-dual interior method for solving linearly constrained optimization problems with a convex objective function []. This report provides an overview of SVMs and the algorithm, and a reformulation of the SVM QP problem to suit. The algorithm is then presented in detail, including the barrier subproblems arising in interior methods, the associated Newton equations, and the sparse-matrix computation used to calculate each search direction. We compare with on the data sets used in Platt s paper: one real-world data set concerning Income Prediction, and two artificial data sets that simulate two extreme cases good or poor data separation. The solver computation times are compared, with showing better scalability. We conclude with discussion of the results and possible future work.. Overview of SVM The support vector machine was invented by Vladimir Vapnik in 979 [5]. In the linear classifier for a binary classification case, an SVM constructs a hyperplane that separates a set of points labeled y = from a set of points labeled y = +, using a vector of features x. The hyperplane is represented by x T w+b =, with vector w and scalar b to be detered. To achieve more confident classification, the classifier should correctly separate the positive labeled points and the negative labeled points with a bigger gap proportional to / w. After normalizing the distances from the nearest points to the hyperplane on both sides, we have the optimization problem w,b w y i x T i w + b, i, Dept of Management Science and Engineering, Stanford University, Stanford, CA, USA where x i, y i denotes the features and label of the ith training sample. By Lagrange duality, the above optimization problem is converted to the following dual optimization problem: max α W α = α i α i, i α i y i =. j= y i y j α i α j x T i x j Once the dual problem is solved, the normal vector w and the threshold b can be detered by w = α i y i x i, b = y j x T j w for some α j >. For the non-separable case, l regularization is applied to make the algorithm still plausible and less sensitive to outliers. The original optimization problem is reformulated as follows: w,b w + C ξ i y i x T i w + b ξ i, ξ i, i. Again by Lagrange duality, the following dual optimization problem is formed: max α W α = α i j= y i y j α i α j x T i x j α i C, i α i y i =. Problem is solved by sequential imal optimization.. for SVM is a special application of the coordinate ascent method for the SVM dual optimization problem. The algorithm is as follows:

Repeat until convergence {. Heuristically choose a pair of α i and α j. Keeping all other α s fixed, optimize W α with respect to α i and α j. } The convergence of is tested by checking if the KKT complementary conditions of are satisfied within some tolerance around 3. The KKT complementary conditions are: α i = y i x T i w + b α i = C y i x T i w + b < α i < C y i x T i w + b = SVM reformulation The SVM dual optimization problem can be reformulate in a way that is suitable for a different optimization algorithm,. We define some matrix and vector quantities as follows: X = x T x T. x T m y α, y = y., α = α., y m α m Y = diagy i, D = diagy i α i, where X is an m n matrix with rows representing training samples and columns corresponding to features for each sample. Then m y iα i x T i = e T DX and De = Y α, where e is a vector of s. Thus, W α = et DXX T De e T α = αt Y T XX T Y α e T α. If we define Z = X T Y R n m and v = Zα, we obtain a new formulation of SVM as follows: α,v W α, v = et α + vt v Zα + v = 3 y T α = α Ce 5 Note that this formulation of the SVM problem is different from five other formulations mentioned by Ferris and Munson [] for their large-scale interior-point QP solver. 3 Primal-Dual Method for Convex Objective is a Matlab solver for optimization problems that are noally of the form CO x φx Ax = b, l x u, where φx is a convex function with known gradient gx and Hessian Hx, and A R m n. The format of CO is suitable for any linear constraints. For example, a double-sided constraint α a T x β α < β should be entered as a T x ξ =, α ξ β, where x and ξ are relevant parts of x. To allow for constrained least-squares problems, and to ensure unique primal and dual solutions and improved stability, really solves the regularized problem CO x, r φx + D x + r Ax + D r = b, l x u, where D, D are specified positive-definite diagonal matrices. The diagonals of D are typically small 3 or. Similarly for D if the constraints in CO should be satisfied reasonably accurately. For least-squares applications, some diagonals of D will be. Note that some elements of l and u may be and + respectively, but we expect no large numbers in A, b, D, D. If D is small, we would expect A to be under-detered m < n. If D = I, A may have any shape. 3. The barrier approach First we introduce slack variables x, x to convert the bounds to non-negativity constraints: CO3 φx + x, r, x, x D x + r Ax + D r = b x x = l x + x = u x, x. Then we replace the non-negativity constraints by the log barrier function, obtaining a sequence of convex subproblems with decreasing values of µ µ > : COµ x, r, x, x φx + D x + r µ j ln[x ] j [x ] j Ax + D r = b : y x x = l : z x x = u, : z where y, z, z denote dual variables for the associated constraints. With µ >, most variables are strictly positive: x, x, z, z >. Exceptions: If l j = or u j =, the corresponding equation is omitted and the jth element of x or x doesn t exist. The KKT conditions for the barrier subproblem involve the three primal equations of COµ, along with four dual equations stating that the gradient of the subproblem objective should be a linear combination of the gradients of the primal constraints: Ax + D r = x x = b l x x = u A T y + z z = gx + D x : x D y = r : r X z = µe : x X z = µe, : x

where X = diagx, X = diagx, and similarly for Z, Z later. The last two equations are commonly called the perturbed complementarity conditions. Initially they are in a different form. The dual equation for x is really z = µ lnx = µx e, where e is a vectors of s. Thus, x > implies z >, and multiplying by X gives the equivalent equation X z = µe as stated. 3. Newton s method We now eliate r = D y and apply Newton s method: Ax + x + D y + y = x + x x + x = l x + x x + x = u A T y + y + z + z z + z = g + H x + D x + x X z + X z + Z x = µe X z + X z + Z x = where g and H are the current objective gradient and Hessian. To solve this Newton system, we work with three sets of residuals: x x rl l x + x =, 6 x x r u u + x + x X z + Z x cl µe X z =, 7 X z + Z x c u µe X z A x + D y r H x + A T = 8 y + z z r b µe, b Ax D y g + D x A T y z + z, 9 where H = H + D. We use 6 and 7 to replace two sets of vectors in 8. With x rl + x z X =, = c l Z x x r u x z X c, u Z x H H + D + X Z + X Z w r X c l + Z r l + X c u + Z r u we find that H A T A D 3.3 Keeping x safe x w =. y r If the objective is the entropy function φx = x j lnx j, for example, it is essential to keep x >. However, problems CO3 and COµ treat x as having no bounds except when x x = l and x + x = u are satisfied in the limit. Thus since April, safeguards x at every iteration by setting x j = [x ] j if l j = and l j < u j, x j = [x ] j if u j = and l j < u j. 3. Solving for x, y If φx is a general convex function with known Hessian H, system may be treated by direct or iterative solvers. Since it is an SQD system, sparse LDL T factors should be sufficiently stable under the same conditions as for regularized LO: A, H, and γδ 8, where γ and δ are the imum values of the diagonals of D and D. If K represents the sparse matrix in, uses the following lines of code to achieve reasonable efficiency: s = symamdk;% sqd ordering first iteration only rhs = [w; r]; thresh = eps; % eps ~= e-6 suppresses partial pivoting [L,U,p] = luks,s,thresh, vector ; % expect p = I sqdsoln = U\L\rhss; sqdsolns = sqdsoln; dx = sqdsoln:n; dy = sqdsolnn+:n+m; The symamd ordering can be reused for all iterations, and with row interchanges effectively suppressed by thresh = eps, we expect the original sparse LU factorization in Matlab to do no further permutations p = :n+m. Alternatively a sparse Cholesky factorization H = LL T may be practical, where H = H + diagonal terms, and L is a nonsingular permuted triangle. This is trivial if φx is a separable function, since H and H in are then diagonal. System may then be solved by eliating either x or y: A T D A + H x = A T D r w, D y = r A x, 3 AH AT + D y = AH w + r, H x = A T y w. Sparse Cholesky factorization may again be applicable, but if an iterative solver must be used it is preferable to regard them as least-squares problems suitable for LSQR or LSMR: x y D A L T x L A T D y D r L T w, D y = D r A x, 5 L w D r, L T x = L A T y w. 6 The right-most vectors in 5 6 are part of the residual vectors for the least-squares problems and may be by-products from the least-squares solver. 3.5 Success has been applied to some large web-traffic network problems with the entropy function x j ln x j as objective. Search directions were obtained by applying LSQR to 6 with diagonal H = X and L = H /, and D =, D = 3 I. A problem with 5, constraints and 66, variables an explicit sparse A solves in about 3 utes on a GHz PC, requiring less than total LSQR iterations. At the time 3, this was unexpectedly remarkable performance. Both the entropy function and the network matrix A are evidently amenable to interior methods. 3

BENCHMARKING was tested here against on a series of benchmarking problems used by John Platt s paper: one real-world data set concerning Income Prediction, and two different artificial data sets that simulate two extreme scenarios perfectly linearly separable and noise.. Income Prediction: decide whether the household income is greater than $5,. The data used are publicly available at http://ftp.ics.uci.edu/pub/machine-learningdatabases/adult/. There were 356 examples in training set. Eight features are categorical and six are continuous. The six continuous attributes were discretized into quintiles, which yielded a total of 9 binary features, of which are true. The training set size was varied, and the limiting value C is chosen to be.5. The results are compared in Table and Figure.. Separable data: the data were randomly generated 3- dimensional binary vectors, with a % fraction of s. A 3-dimensional weight vector was generated randomly in [-,], to decide the labels of examples. A positive label is assigned to the input point whose dot product with the weight vector is greater than ; and if the dot product is less than -, the point is labeled -; otherwise, the point is discarded. C = The results are shown in Table and Figure. 3. Noise data: generated 3-dimensional random binary input points % s and random output labels. C =. In this case, SVM is fitting pure noise. The results are shown in Table 3 and Figure 3. Specifically, for equations 5, the quantities in s problem CO are x = α, r = [v; δ], φx = e T α, A = [Z; y T ] = [X T Y ; y T ], H =, D = 3 I, and D = I except the last diagonal of D is ρ = to represent equality constraint as y T α + ρδ =, where ρδ will be very small. The feasibility and optimality tolerances were set to 6, giving an accurate solution to well-scaled data. s new CO are as follows: α,v,δ W α, v, δ = e T α + T v v δ δ X T Y I v y T α + = ρ δ α Ce The main work per iteration for lies in forg the matrix AH AT +D in and its sparse Cholesky factorization. The computation times for and on three sets of data with varying numbers of samples are shown in Tables 3. The times were obtained with C++ on a 66 MHz Pentium II processor, which is approximately times slower than Matlab on the imac.93ghz machine that we used for. To make the CPU times roughly comparable, we divided the times by in the Figures 3. CPU time secs Table : vs on Income Prediction data 8 6 8 6 65.87. 65.8.9 385.56.8 78.89 3.6 6.5 5.5.6 7. 6 3.55 35.3 697 6. 85.7 356.35 63.6 vs on Income Prediction.5.5.5 3 3.5 x Figure : vs on Income Prediction 5 Discussion The tables show that the computation times for scale approximately linearly on all three data sets. Plotting these against times for revealed different slopes for each method. The figures show that scales better with data size m than on these examples. One reason is that the number of iterations increases slowly with m. Each iteration requires a sparse Cholesky factorization of an n n matrix of the form X T DX for some positive diagonal matrix D, with the number of features n being only 9 or 3 in these examples. The storage for the Cholesky factor depends on the sparsity of X and X T X. This storage is imal for n, say, but may remain tolerable for much bigger n if X and X T X remain extremely sparse. For problems exceeding the capacity of RAM, there are distributed implementations of sparse Cholesky factorization, such as MUMPS []. Ferris and Munson [] have already shown that very large SVM problems can be processed by interior methods using distributed memory. The same would be true with a distributed implementation of. This project suggests that sparse interior-point methods may prove to be competitive with

Table : vs on Separable data.7 5.3.695 33. 5.39 3. 3.8 86.8 7.39 8. Table 3: vs on Noise data 5.36..399 3.5.59 5.7 5.7 67.6. 87. 3 5 vs on Separable data 8 6 vs on Noise data CPU time secs 5 CPU time secs 8 6 5...6.8...6.8 x 3 5 6 7 8 9 Figure : vs on Separable data Figure 3: vs on Noise data coordinate-descent methods like. We hope that future implementations will resolve this question. References [] M. C. Ferris and T. S. Munson. Interior-point methods for massive support vector machines. SIAM J. Optim., 33:783 8, 3. [] MUMPS: a multifrontal massively parallel sparse direct solver. http://mumps.enseeiht.fr/. [3] J. C. Platt. Sequential Minimal Optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-, Microsoft Research, 998. [] M. A. Saunders. convex optimization software. http: //stanford.edu/group/sol/software/pdco/. [5] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 98. 5