SMO Algorithms for Support Vector Machines without Bias Term

Similar documents
Machine Learning. Support Vector Machines. Manfred Huber

Statistical Machine Learning from Data

Support Vector Machines: Maximum Margin Classifiers

Sequential Minimal Optimization (SMO)

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Support Vector Machine (continued)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Jeff Howbert Introduction to Machine Learning Winter

Non-linear Support Vector Machines

Support Vector Machine

Incorporating detractors into SVM classification

Linear & nonlinear classifiers

Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machine (SVM) and Kernel Methods

Lecture Notes on Support Vector Machine

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear & nonlinear classifiers

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machine (SVM) and Kernel Methods

Perceptron Revisited: Linear Separators. Support Vector Machines

Soft-margin SVM can address linearly separable problems with outliers

Support Vector Machines and Kernel Methods

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

Learning with kernels and SVM

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

UVA CS 4501: Machine Learning

L5 Support Vector Classification

Discussion of Some Problems About Nonlinear Time Series Prediction Using ν-support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines

Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Convex Optimization and Support Vector Machine

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Applied inductive learning - Lecture 7

Introduction to Support Vector Machines

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines

Support Vector Machines

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Pattern Recognition 2018 Support Vector Machines

An SMO algorithm for the Potential Support Vector Machine

Lecture 18: Optimization Programming

An introduction to Support Vector Machines

Kernel Methods. Machine Learning A W VO

Formulation with slack variables

SVMs, Duality and the Kernel Trick

CS798: Selected topics in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines for Regression

Least Squares SVM Regression

Support Vector Machine I

Polyhedral Computation. Linear Classifiers & the SVM

Support Vector Machines

Support Vector Machine for Classification and Regression

Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

Support Vector Machines and Kernel Algorithms

Max Margin-Classifier

LECTURE 7 Support vector machines

Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

The maximum margin classifier. Machine Learning and Bayesian Inference. Dr Sean Holden Computer Laboratory, Room FC06

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Introduction to SVM and RVM

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning A Geometric Approach

On-line Support Vector Machine Regression

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

LMS Algorithm Summary

Solving the SVM Optimization Problem

Gaps in Support Vector Optimization

Announcements - Homework

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Learning From Data Lecture 25 The Kernel Trick

Support vector machines

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines. Maximizing the Margin

Support Vector Machine Regression for Volatile Stock Market Prediction

Transcription:

Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002 1 SMO Classification without Bias Term Platt s SMO algorithm [1] can be modified if the bias term b of the support vector machine does not need to be computed. This happens for some kernel functions, such as inhomogeneous polynomials or Gaussian bells, that provide an implicit bias. In this case the modification leads to a simpler and faster algorithm. 1.1 The Minimization Problem The main advantage of an implicit bias is the fact that the summation constraint of the dual optimization problem does not exist, since the derivative of the primal Lagrangian with respect to b vanishes. The dual problem is then given by L = 1 2 j=1 α i α j y i y j k ij α i = min α. s.t. 0 α i C for i = 1,..., N, where α i are the dual variables, i.e., the Lagrange multipliers of the primal problem. y i {±1} are the desired output values, and k ij = k(x i, x j are the kernel function values. The slack variables ξ i 0 do not appear in the dual problem. In the original algorithm, two parameters have to be optimized per step to ensure that the solution obeys the summation constraint. The algorithm adapted to the problem defined above sequentially optimizes only one parameter α i per step, which makes it significantly easier and faster: On the one hand, the analytical solution is easier to compute for a one-dimensional problem. On the other hand, we do not need an (occasionally time-consuming selection of the second parameter. The parameter to optimize is simply chosen by Platt s first choice heuristic. The absence of the second choice heuristic raises the question, if an error cache (or function cache is needed. However, simulations have shown that a cache speeds up the algorithm, since the cache values can also made use of for checking the KKT conditions. 1

1.2 The Analytical Solution In each step of the SMO algorithm, L ist minimized with respect to the chosen parameter, which can be done analytically. Without loss of generality, we assume that α 1 will be optimized whereas α 2,..., α N are fixed. Additionally, it is assumed that y 2 i = 1 (since y i {±1} and k ij = k ji. In that case we have to separate all terms of L containing α 1 : L = 1 2 α i y i (α 1 y 1 k i1 + = 1 2 α N 1y 1 α i y i k i1 + 1 2 α j y j k ij α 1 α i y i = 1 2 α N 1y 1 α i y i k i1 + 1 2 α 1y 1 N N = 1 2 α2 1 y2 1 + 1 2 α N 1y 1 α i y i k i1 + 1 2 α 1y 1 = 1 2 α 2 1 + ( y 1 N α i } {{ } A 1 α j y j k ij α 1 + A 1 α j y j k 1j + 1 α i α j y i y j k ij α 1 + A 1 2 A 2 N α j y j k 1j 1 α 1 + const. With α i = α old i for i = 2,..., N and the cached SVM output values follows and therefore α j y j k 1j = f i = j=1 α old j y j k ij α old j y j k 1j = f 1 α old 1 y 1 α j y j k 1j α 1 + A 1 + A 2 L = 1 2 α 2 1 + ( y 1 f 1 α old 1 1 α 1 + const. L is minimal with respect to α 1 if L/ α 1 = 0: L α 1 = α 1 + (y 1 f 1 α old 1 1! = 0 (α 1 α old 1 = 1 y 1f 1 α 1 = α old 1 + 1 y 1f 1 Since we want to use an error cache containing the error values E i = f i y i instead of the function values f i, we can exploit the equality 1 y i f i = y 2 i y i f i = y i (f i y i = y i E i 2

(which relies on the fact that y i {±1} and therefore y 2 i = 1 to express the solution of the unconstrained minimization problem as α 1 = α old 1 y 1E 1. The solution of the constrained problem is found by clipping α 1 to the interval [0, C]: α new 1 = min { max { α 1, 0 }, C } 1.3 Checking the KKT Conditions The Karush-Kuhn-Tucker conditions can be exploited to check the optimality of the current solution. Before each optimization step, the KKT conditions for α i are evaluated to decide if an update of α i is needed. This check takes only very few computation time if an error cache or function cache is used. The conditions for α i are (y i f i + ξ i 1 α i = 0 (C α i ξ i = 0, where ξ i are the slack variables of the primal problem. At the optimal solution, the KKT conditions can be fulfilled in three different ways: Case 1: α i = 0 C α i = C 0 ξ i = 0 (y i f i + ξ i 1 α i = 0 y i f i 1 0 The sign in the last step follows from the constraints of the primal problem. Case 2: 0 < α i < C C α i > 0 ξ i = 0 (y i f i + ξ i 1 α i = 0 y i f i 1 = 0 0 Case 3: α i = C C α i = 0 ξ i 0 (y i f i + ξ i 0 1 α i = 0 y i f i 1 = ξ i 0 =C>0 The sign in the first step follows from the constraints of the primal problem. Since we prefer an error cache (instead of a function cache, an optimal α i has to obey one of the following three conditions α i = 0 y i E i 0 or 0 < α i < C y i E i = 0 or α i = C y i E i 0. 3

In other words, α i needs to be updated if α i < C y i E i < 0 or α i > 0 y i E i > 0 SMO assumes α i to be optimal, if the KKT conditions are fulfilled with some precision τ. A typical value for classification tasks is τ = 10 3. Therefore SMO performs an update only if α i < C y i E i < τ or α i > 0 y i E i > τ 1.4 Updating the Cache As described above, a cache should be used to speed up the checking of the KKT conditions. If the parameter α 1 has been optimized, a function cache can be updated in the following way: f i = α j y j k ij = α 1 y 1 k i1 + j=1 = α 1 y 1 k i1 + (fi old α old α old j y j k ij 1 y 1k i1 = fi old + (α 1 α old 1 y 1k i1 With the abbreviation t 1 = (α 1 α old 1 y 1 (which is independent of i the update rule for f i can be written as f i = f old i + t 1 k i1 for i = 1,..., N. Since y i is constant, the update of an error cache works in the same way: E i = f i y i = f old i + t 1 k i1 y i = E old i + y i + t 1 k i1 y i = Ei old + t 1 k i1 Platt s original SMO algorithm updates only those E i (or f i, respectively with 0 < α i < C, because these are the only parameters involved in the second choice heuristic [1]. When SMO enters the takestep procedure only these parameters are retrieved form the cache for the KKT check all other values are re-computed. This trick is not such crucial for the simplified algorithm, since there is no second choice heuristic. It provides only a small acceleration of the algorithm. 2 SMO Regression without Bias Term SMO regression works similar to SMO classification. As the main difference, each data point requires two slack variables ξ i and ξ i measuring the distance above and below the ε-tube [2]. Only one of both parameters is different from zero. Consequently, the solution has 2N parameters but at least half of them are zero. As in the classification case, the standard algorithm can be simplified if the use of implicit bias kernel functions makes an explicit bias dispensable. 4

2.1 The Minimization Problem Also for regression an implicit bias eliminates the summation constrained of the dual problem, which is now given by L = 1 2 j=1 (α i α i (α j α j k ij + ε s.t. 0 α i, α i C for i = 1,..., N (α i + α i y i (α i α i = min α,α. with the dual variables α i and α i. Also for the dual variables the equation α i α i = 0 holds true. y i are the desired output values, and k ij = k(x i, x j are the kernel function values. The primal slack variables ξ i and ξ i only appear in the KKT conditions but not in the dual problem. The simplified algorithm chooses a sample i by Platt s first choice heuristic and then performs a joint optimization of α i and α i. An algorithm based on the original SMO method would have to optimize four parameters in one step to obey the summation constraint, as shown by Schölkopf and Smola [2]. 2.2 The Analytical Solution In contrast to classification (where α i does not exist, L is minimized with respect to both α i and α i in each step. Again, without loss of generality, we assume that α 1 and α 1 are the variables to be optimized, and separate them in L: L = 1 2 ( (α i α i (α 1 α 1 k i1 + (α j α j k ij N +ε(α 1 + α 1 +ε N (α i + α i y 1 (α 1 α 1 y i (α i α i } {{ } A 1 = 1 N 2 (α 1 α 1 (α i α i k i1 + 1 2 (α i α i } {{ } A 2 (α j α j k ij +ε(α 1 + α 1 y 1(α 1 α 1 + A 1 + A 2 = 1 2 (α 1 α 1 2 + 1 N 2 (α 1 α 1 (α i α i k i1 + 1 N 2 (α 1 α 1 (α j α j k 1j + 1 (α i α i (α j α j k ij +ε(α 1 + α 1 2 y 1(α 1 α 1 + A 1 + A 2 A 3 = 1 ( 2 k N 11(α 1 α 1 2 + (α j α j k 1j y 1 (α 1 α 1 + ε(α 1 + α 1 + A 1 + A 2 + A 3. With α i = α old i and α i = α old i for i = 2,..., N 5

and the cached SVM output values f i = j=1 (α old j α old j k ij follows and therefore (α j α j k 1j = L = 1 2 (α 1 α 1 2 + (α old j ( f 1 (α old = 1 2 α 2 1 + ( f 1 (α old 1 y 1 + ε α old j k 1j = f 1 (α old 1 1 y 1 (α 1 α 1 + ε(α 1 + α 1 + const. α 1 + 1 2 α 2 1 ( f 1 (α old 1 y 1 ε α 1 + const. This result is achieved because only one of the variables α 1 and α 1 can be different from zero, which implies α 1 α 1 = 0. Again, as we usually want to employ an error cache instead of a function cache, L reads as L = 1 ( 2 α 2 1 + E 1 (α old 1 + ε α 1 + 1 ( 2 α 2 1 E 1 (α old 1 ε α 1 + const. The solution of the unconstrained problem is found by evaluating the conditions L/ α 1 = 0, and L/ α 1 = 0, respectively. Since there are no cross terms, α 1 and α 1 can be determined independently: L ( = α 1 + E 1 (α old!= 1 + ε 0 α 1 α 1 is computed in the same way: L α 1 α 1 = E 1 + (α old 1 ε α 1 = α old 1 E 1 + ε = α 1 ( E 1 (α old 1 ε!= 0 α 1 = E 1 (α old 1 ε α 1 = (α old 1 + E 1 ε = α 1 2ε As mentioned above, L does not contain any cross terms which implies that the minimum of the constrained optimization problem can be found again by clipping α 1 and α 1 to the interval [0, C]: α new 1 = min { max { α 1, 0 }, C } α new 1 = min { max { α 1, 0}, C } 6

2.3 Checking the KKT Conditions Like in the classification case, the Karush-Kuhn-Tucker conditions are exploited for an optimality check. Here we have to consider the conditions for α i and for α i (ε + ξ i + f i y i α i = 0 (C α i ξ i = 0 (ε + ξ i f i + y i α i = 0 (C α i ξ i = 0 Starting with α i we have again to consider three different ways to obey the KKT conditions in the optimum: Case 1: α i = 0 C α i = C 0 ξ i = 0 (ε + ξ i +f i y i α i = 0 ε + f i y i 0 The sign in the last step follows from the constraints of the primal problem. Case 2: 0 < α i < C C α i > 0 ξ i = 0 (ε + ξ i +f i y i α i = 0 ε + f i y i = 0 0 Case 3: α i = C C α i = 0 ξ i 0 (ε + ξ i 0 +f i y i α i = 0 ε + f i y i = ξ i 0 =C>0 The sign in the first step follows from the constraints of the primal problem. Using (as usual the error values E i = f i y i we can summarize the conditions for an optimal solution α i by α i = 0 ε + E i 0 or 0 < α i < C ε + E i = 0 or α i = C ε + E i 0. In the same way we find the conditions for α i as α i = 0 ε E i 0 or 0 < α i < C ε E i = 0 or α i = C ε E i 0. 7

As α i and α i are jointly optimized, the solution is sub-optimal with respect to both variables if α i < C ε + E i < 0 or α i > 0 ε + E i > 0 or α i < C ε E i < 0 or α i > 0 ε E i > 0 SMO assumes α i and α i to be optimal, if the KKT conditions are fulfilled with some precision τ. In contrast to classification tasks (where the output values are ±1, a fixed τ is not suitable for regression problems. It should rather be dependent from the magnitude of the output values, e.g., τ = max { y 1,..., y N } 10 3. SMO performs a joint update of α i and α i if α i < C ε + E i < τ or α i > 0 ε + E i > τ or α i < C ε E i < τ or α i > 0 ε E i > τ 2.4 Updating the Cache Also for regression problems a cache should be used for quickly checking the KKT conditions. The update of the function values works in the following way: f i = (α j α j k ij = (α 1 α 1 k i1 + j=1 = (α 1 α 1 k i1+ ( fi old (α old With the auxiliary variable t 1 = α 1 α 1 αold equation can be written as or, if an error cache is employed, as (α old j 1 k i1 = f old i 1 +α old α old j k ij f i = f old i + t 1 k i1 for i = 1,..., N, E i = E old i + t 1 k i1 for i = 1,..., N. + (α 1 α 1 αold 1 + α old 1 k i1 1, which is independent from i, the update References [1] John C. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Kernel Methods Support Vector Learning. B. Schölkopf, C. Burges, and A. Smola, eds., MIT Press, 1999. [2] Bernhard Schölkopf and Alexander Smola: Learning with Kernels Support Vector Machines, Optimization, and Beyond. MIT Press, 2002. 8