SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

Similar documents
Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

MS&E 318 (CME 338) Large-Scale Numerical Optimization. 1 Interior methods for linear optimization

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Lecture Notes on Support Vector Machine

Convex Optimization and Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

Homework 4. Convex Optimization /36-725

Machine Learning 4771

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine (continued)

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Lecture 25: November 27

Jeff Howbert Introduction to Machine Learning Winter

Lecture 3: Lagrangian duality and algorithms for the Lagrangian dual problem

Linear & nonlinear classifiers

Support Vector Machines for Classification and Regression

LINEAR AND NONLINEAR PROGRAMMING

Support vector machines

Support Vector Machines

LMI Methods in Optimal and Robust Control

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Statistical Pattern Recognition

Kernel Methods and Support Vector Machines

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Lecture 18: Optimization Programming

Support Vector Machines and Kernel Methods

Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

10 Numerical methods for constrained problems

CS145: INTRODUCTION TO DATA MINING

Statistical Machine Learning from Data

Support Vector Machines

Support Vector Machines

Pattern Recognition 2018 Support Vector Machines

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Max Margin-Classifier

12. Interior-point methods

Linear & nonlinear classifiers

Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

SVM May 2007 DOE-PI Dianne P. O Leary c 2007

UVA CS 4501: Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Announcements - Homework

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Constrained Optimization and Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Interior Point Algorithms for Constrained Convex Optimization

minimize x subject to (x 2)(x 4) u,

Support Vector Machines, Kernel SVM

Machine Learning Lecture 7

What s New in Active-Set Methods for Nonlinear Optimization?

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LECTURE 7 Support vector machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Interior Point Methods in Mathematical Programming

A convex QP solver based on block-lu updates

Lecture 2: Linear SVM in the Dual

CS-E4830 Kernel Methods in Machine Learning

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines.

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Experiments with iterative computation of search directions within interior methods for constrained optimization

An introduction to Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

ICS-E4030 Kernel Methods in Machine Learning

Operations Research Lecture 4: Linear Programming Interior Point Method

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support Vector Machines

Using interior point methods for optimization in training very large scale Support Vector Machines. Interior Point Methods for QP. Elements of the IPM

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Large Scale Semi-supervised Linear SVMs. University of Chicago

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

On Generalized Primal-Dual Interior-Point Methods with Non-uniform Complementarity Perturbations for Quadratic Programming

Interior Point Methods for Convex Quadratic and Convex Nonlinear Programming

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Linear Programming: Simplex

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

1 Computing with constraints

SVMs, Duality and the Kernel Trick

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

Nonlinear Optimization and Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

MLCC 2017 Regularization Networks I: Linear Models

Support Vector Machine

Change point method: an exact line search method for SVMs

Perceptron Revisited: Linear Separators. Support Vector Machines

Transcription:

vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning, support vector machines SVMs are one of the best supervised learning models for data classification and regression analysis. Sequential imal optimization is an algorithm for solving the quadratic programg QP problem that arises in SVMs. It was invented by John Platt at Microsoft Research in 998 [3] and is widely used for solving SVM models. Primal-Dual interior method for Convex Objectives is a Matlab primal-dual interior method for solving linearly constrained optimization problems with a convex objective function []. This report provides an overview of SVMs and the algorithm, and a reformulation of the SVM QP problem to suit. The algorithm is then presented in detail, including the barrier subproblems arising in interior methods, the associated Newton equations, and the sparse-matrix computation used to calculate each search direction. We compare with on the data sets used in Platt s paper: one real-world data set concerning Income Prediction, and two artificial data sets that simulate two extreme cases good or poor data separation. The solver computation times are compared, with showing better scalability. We conclude with discussion of the results and possible future work.. Overview of SVM The support vector machine was invented by Vladimir Vapnik in 979 [5]. In the linear classifier for a binary classification case, an SVM constructs a hyperplane that separates a set of points labeled y = from a set of points labeled y = +, using a vector of features x. The hyperplane is represented by x T w+b =, with vector w and scalar b to be detered. To achieve more confident classification, the classifier should correctly separate the positive labeled points and the negative labeled points with a bigger gap proportional to / w. After normalizing the distances from the nearest points to the hyperplane on both sides, we have the optimization problem w,b w y i x T i w + b, i, Dept of Management Science and Engineering, Stanford University, Stanford, CA, USA where x i, y i denotes the features and label of the ith training sample. By Lagrange duality, the above optimization problem is converted to the following dual optimization problem: max α W α = α i α i, i α i y i =. j= y i y j α i α j x T i x j Once the dual problem is solved, the normal vector w and the threshold b can be detered by w = α i y i x i, b = y j x T j w for some α j >. For the non-separable case, l regularization is applied to make the algorithm still plausible and less sensitive to outliers. The original optimization problem is reformulated as follows: w,b w + C ξ i y i x T i w + b ξ i, ξ i, i. Again by Lagrange duality, the following dual optimization problem is formed: max α W α = α i j= y i y j α i α j x T i x j α i C, i α i y i =. Problem is solved by sequential imal optimization.. for SVM is a special application of the coordinate ascent method for the SVM dual optimization problem. The algorithm is as follows:

Repeat until convergence {. Heuristically choose a pair of α i and α j. Keeping all other α s fixed, optimize W α with respect to α i and α j. } The convergence of is tested by checking if the KKT complementary conditions of are satisfied within some tolerance around 3. The KKT complementary conditions are: α i = y i x T i w + b α i = C y i x T i w + b < α i < C y i x T i w + b = SVM reformulation The SVM dual optimization problem can be reformulate in a way that is suitable for a different optimization algorithm,. We define some matrix and vector quantities as follows: X = x T x T. x T m y α, y = y., α = α., y m α m Y = diagy i, D = diagy i α i, where X is an m n matrix with rows representing training samples and columns corresponding to features for each sample. Then m y iα i x T i = e T DX and De = Y α, where e is a vector of s. Thus, W α = et DXX T De e T α = αt Y T XX T Y α e T α. If we define Z = X T Y R n m and v = Zα, we obtain a new formulation of SVM as follows: α,v W α, v = et α + vt v Zα + v = 3 y T α = α Ce 5 Note that this formulation of the SVM problem is different from five other formulations mentioned by Ferris and Munson [] for their large-scale interior-point QP solver. 3 Primal-Dual Method for Convex Objective is a Matlab solver for optimization problems that are noally of the form CO x φx Ax = b, l x u, where φx is a convex function with known gradient gx and Hessian Hx, and A R m n. The format of CO is suitable for any linear constraints. For example, a double-sided constraint α a T x β α < β should be entered as a T x ξ =, α ξ β, where x and ξ are relevant parts of x. To allow for constrained least-squares problems, and to ensure unique primal and dual solutions and improved stability, really solves the regularized problem CO x, r φx + D x + r Ax + D r = b, l x u, where D, D are specified positive-definite diagonal matrices. The diagonals of D are typically small 3 or. Similarly for D if the constraints in CO should be satisfied reasonably accurately. For least-squares applications, some diagonals of D will be. Note that some elements of l and u may be and + respectively, but we expect no large numbers in A, b, D, D. If D is small, we would expect A to be under-detered m < n. If D = I, A may have any shape. 3. The barrier approach First we introduce slack variables x, x to convert the bounds to non-negativity constraints: CO3 φx + x, r, x, x D x + r Ax + D r = b x x = l x + x = u x, x. Then we replace the non-negativity constraints by the log barrier function, obtaining a sequence of convex subproblems with decreasing values of µ µ > : COµ x, r, x, x φx + D x + r µ j ln[x ] j [x ] j Ax + D r = b : y x x = l : z x x = u, : z where y, z, z denote dual variables for the associated constraints. With µ >, most variables are strictly positive: x, x, z, z >. Exceptions: If l j = or u j =, the corresponding equation is omitted and the jth element of x or x doesn t exist. The KKT conditions for the barrier subproblem involve the three primal equations of COµ, along with four dual equations stating that the gradient of the subproblem objective should be a linear combination of the gradients of the primal constraints: Ax + D r = x x = b l x x = u A T y + z z = gx + D x : x D y = r : r X z = µe : x X z = µe, : x

where X = diagx, X = diagx, and similarly for Z, Z later. The last two equations are commonly called the perturbed complementarity conditions. Initially they are in a different form. The dual equation for x is really z = µ lnx = µx e, where e is a vectors of s. Thus, x > implies z >, and multiplying by X gives the equivalent equation X z = µe as stated. 3. Newton s method We now eliate r = D y and apply Newton s method: Ax + x + D y + y = x + x x + x = l x + x x + x = u A T y + y + z + z z + z = g + H x + D x + x X z + X z + Z x = µe X z + X z + Z x = where g and H are the current objective gradient and Hessian. To solve this Newton system, we work with three sets of residuals: x x rl l x + x =, 6 x x r u u + x + x X z + Z x cl µe X z =, 7 X z + Z x c u µe X z A x + D y r H x + A T = 8 y + z z r b µe, b Ax D y g + D x A T y z + z, 9 where H = H + D. We use 6 and 7 to replace two sets of vectors in 8. With x rl + x z X =, = c l Z x x r u x z X c, u Z x H H + D + X Z + X Z w r X c l + Z r l + X c u + Z r u we find that H A T A D 3.3 Keeping x safe x w =. y r If the objective is the entropy function φx = x j lnx j, for example, it is essential to keep x >. However, problems CO3 and COµ treat x as having no bounds except when x x = l and x + x = u are satisfied in the limit. Thus since April, safeguards x at every iteration by setting x j = [x ] j if l j = and l j < u j, x j = [x ] j if u j = and l j < u j. 3. Solving for x, y If φx is a general convex function with known Hessian H, system may be treated by direct or iterative solvers. Since it is an SQD system, sparse LDL T factors should be sufficiently stable under the same conditions as for regularized LO: A, H, and γδ 8, where γ and δ are the imum values of the diagonals of D and D. If K represents the sparse matrix in, uses the following lines of code to achieve reasonable efficiency: s = symamdk;% sqd ordering first iteration only rhs = [w; r]; thresh = eps; % eps ~= e-6 suppresses partial pivoting [L,U,p] = luks,s,thresh, vector ; % expect p = I sqdsoln = U\L\rhss; sqdsolns = sqdsoln; dx = sqdsoln:n; dy = sqdsolnn+:n+m; The symamd ordering can be reused for all iterations, and with row interchanges effectively suppressed by thresh = eps, we expect the original sparse LU factorization in Matlab to do no further permutations p = :n+m. Alternatively a sparse Cholesky factorization H = LL T may be practical, where H = H + diagonal terms, and L is a nonsingular permuted triangle. This is trivial if φx is a separable function, since H and H in are then diagonal. System may then be solved by eliating either x or y: A T D A + H x = A T D r w, D y = r A x, 3 AH AT + D y = AH w + r, H x = A T y w. Sparse Cholesky factorization may again be applicable, but if an iterative solver must be used it is preferable to regard them as least-squares problems suitable for LSQR or LSMR: x y D A L T x L A T D y D r L T w, D y = D r A x, 5 L w D r, L T x = L A T y w. 6 The right-most vectors in 5 6 are part of the residual vectors for the least-squares problems and may be by-products from the least-squares solver. 3.5 Success has been applied to some large web-traffic network problems with the entropy function x j ln x j as objective. Search directions were obtained by applying LSQR to 6 with diagonal H = X and L = H /, and D =, D = 3 I. A problem with 5, constraints and 66, variables an explicit sparse A solves in about 3 utes on a GHz PC, requiring less than total LSQR iterations. At the time 3, this was unexpectedly remarkable performance. Both the entropy function and the network matrix A are evidently amenable to interior methods. 3

BENCHMARKING was tested here against on a series of benchmarking problems used by John Platt s paper: one real-world data set concerning Income Prediction, and two different artificial data sets that simulate two extreme scenarios perfectly linearly separable and noise.. Income Prediction: decide whether the household income is greater than $5,. The data used are publicly available at http://ftp.ics.uci.edu/pub/machine-learningdatabases/adult/. There were 356 examples in training set. Eight features are categorical and six are continuous. The six continuous attributes were discretized into quintiles, which yielded a total of 9 binary features, of which are true. The training set size was varied, and the limiting value C is chosen to be.5. The results are compared in Table and Figure.. Separable data: the data were randomly generated 3- dimensional binary vectors, with a % fraction of s. A 3-dimensional weight vector was generated randomly in [-,], to decide the labels of examples. A positive label is assigned to the input point whose dot product with the weight vector is greater than ; and if the dot product is less than -, the point is labeled -; otherwise, the point is discarded. C = The results are shown in Table and Figure. 3. Noise data: generated 3-dimensional random binary input points % s and random output labels. C =. In this case, SVM is fitting pure noise. The results are shown in Table 3 and Figure 3. Specifically, for equations 5, the quantities in s problem CO are x = α, r = [v; δ], φx = e T α, A = [Z; y T ] = [X T Y ; y T ], H =, D = 3 I, and D = I except the last diagonal of D is ρ = to represent equality constraint as y T α + ρδ =, where ρδ will be very small. The feasibility and optimality tolerances were set to 6, giving an accurate solution to well-scaled data. s new CO are as follows: α,v,δ W α, v, δ = e T α + T v v δ δ X T Y I v y T α + = ρ δ α Ce The main work per iteration for lies in forg the matrix AH AT +D in and its sparse Cholesky factorization. The computation times for and on three sets of data with varying numbers of samples are shown in Tables 3. The times were obtained with C++ on a 66 MHz Pentium II processor, which is approximately times slower than Matlab on the imac.93ghz machine that we used for. To make the CPU times roughly comparable, we divided the times by in the Figures 3. CPU time secs Table : vs on Income Prediction data 8 6 8 6 65.87. 65.8.9 385.56.8 78.89 3.6 6.5 5.5.6 7. 6 3.55 35.3 697 6. 85.7 356.35 63.6 vs on Income Prediction.5.5.5 3 3.5 x Figure : vs on Income Prediction 5 Discussion The tables show that the computation times for scale approximately linearly on all three data sets. Plotting these against times for revealed different slopes for each method. The figures show that scales better with data size m than on these examples. One reason is that the number of iterations increases slowly with m. Each iteration requires a sparse Cholesky factorization of an n n matrix of the form X T DX for some positive diagonal matrix D, with the number of features n being only 9 or 3 in these examples. The storage for the Cholesky factor depends on the sparsity of X and X T X. This storage is imal for n, say, but may remain tolerable for much bigger n if X and X T X remain extremely sparse. For problems exceeding the capacity of RAM, there are distributed implementations of sparse Cholesky factorization, such as MUMPS []. Ferris and Munson [] have already shown that very large SVM problems can be processed by interior methods using distributed memory. The same would be true with a distributed implementation of. This project suggests that sparse interior-point methods may prove to be competitive with

Table : vs on Separable data.7 5.3.695 33. 5.39 3. 3.8 86.8 7.39 8. Table 3: vs on Noise data 5.36..399 3.5.59 5.7 5.7 67.6. 87. 3 5 vs on Separable data 8 6 vs on Noise data CPU time secs 5 CPU time secs 8 6 5...6.8...6.8 x 3 5 6 7 8 9 Figure : vs on Separable data Figure 3: vs on Noise data coordinate-descent methods like. We hope that future implementations will resolve this question. References [] M. C. Ferris and T. S. Munson. Interior-point methods for massive support vector machines. SIAM J. Optim., 33:783 8, 3. [] MUMPS: a multifrontal massively parallel sparse direct solver. http://mumps.enseeiht.fr/. [3] J. C. Platt. Sequential Minimal Optimization: A fast algorithm for training support vector machines. Technical Report MSR-TR-98-, Microsoft Research, 998. [] M. A. Saunders. convex optimization software. http: //stanford.edu/group/sol/software/pdco/. [5] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 98. 5