Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002 1 SMO Classification without Bias Term Platt s SMO algorithm [1] can be modified if the bias term b of the support vector machine does not need to be computed. This happens for some kernel functions, such as inhomogeneous polynomials or Gaussian bells, that provide an implicit bias. In this case the modification leads to a simpler and faster algorithm. 1.1 The Minimization Problem The main advantage of an implicit bias is the fact that the summation constraint of the dual optimization problem does not exist, since the derivative of the primal Lagrangian with respect to b vanishes. The dual problem is then given by L = 1 2 j=1 α i α j y i y j k ij α i = min α. s.t. 0 α i C for i = 1,..., N, where α i are the dual variables, i.e., the Lagrange multipliers of the primal problem. y i {±1} are the desired output values, and k ij = k(x i, x j are the kernel function values. The slack variables ξ i 0 do not appear in the dual problem. In the original algorithm, two parameters have to be optimized per step to ensure that the solution obeys the summation constraint. The algorithm adapted to the problem defined above sequentially optimizes only one parameter α i per step, which makes it significantly easier and faster: On the one hand, the analytical solution is easier to compute for a one-dimensional problem. On the other hand, we do not need an (occasionally time-consuming selection of the second parameter. The parameter to optimize is simply chosen by Platt s first choice heuristic. The absence of the second choice heuristic raises the question, if an error cache (or function cache is needed. However, simulations have shown that a cache speeds up the algorithm, since the cache values can also made use of for checking the KKT conditions. 1
1.2 The Analytical Solution In each step of the SMO algorithm, L ist minimized with respect to the chosen parameter, which can be done analytically. Without loss of generality, we assume that α 1 will be optimized whereas α 2,..., α N are fixed. Additionally, it is assumed that y 2 i = 1 (since y i {±1} and k ij = k ji. In that case we have to separate all terms of L containing α 1 : L = 1 2 α i y i (α 1 y 1 k i1 + = 1 2 α N 1y 1 α i y i k i1 + 1 2 α j y j k ij α 1 α i y i = 1 2 α N 1y 1 α i y i k i1 + 1 2 α 1y 1 N N = 1 2 α2 1 y2 1 + 1 2 α N 1y 1 α i y i k i1 + 1 2 α 1y 1 = 1 2 α 2 1 + ( y 1 N α i } {{ } A 1 α j y j k ij α 1 + A 1 α j y j k 1j + 1 α i α j y i y j k ij α 1 + A 1 2 A 2 N α j y j k 1j 1 α 1 + const. With α i = α old i for i = 2,..., N and the cached SVM output values follows and therefore α j y j k 1j = f i = j=1 α old j y j k ij α old j y j k 1j = f 1 α old 1 y 1 α j y j k 1j α 1 + A 1 + A 2 L = 1 2 α 2 1 + ( y 1 f 1 α old 1 1 α 1 + const. L is minimal with respect to α 1 if L/ α 1 = 0: L α 1 = α 1 + (y 1 f 1 α old 1 1! = 0 (α 1 α old 1 = 1 y 1f 1 α 1 = α old 1 + 1 y 1f 1 Since we want to use an error cache containing the error values E i = f i y i instead of the function values f i, we can exploit the equality 1 y i f i = y 2 i y i f i = y i (f i y i = y i E i 2
(which relies on the fact that y i {±1} and therefore y 2 i = 1 to express the solution of the unconstrained minimization problem as α 1 = α old 1 y 1E 1. The solution of the constrained problem is found by clipping α 1 to the interval [0, C]: α new 1 = min { max { α 1, 0 }, C } 1.3 Checking the KKT Conditions The Karush-Kuhn-Tucker conditions can be exploited to check the optimality of the current solution. Before each optimization step, the KKT conditions for α i are evaluated to decide if an update of α i is needed. This check takes only very few computation time if an error cache or function cache is used. The conditions for α i are (y i f i + ξ i 1 α i = 0 (C α i ξ i = 0, where ξ i are the slack variables of the primal problem. At the optimal solution, the KKT conditions can be fulfilled in three different ways: Case 1: α i = 0 C α i = C 0 ξ i = 0 (y i f i + ξ i 1 α i = 0 y i f i 1 0 The sign in the last step follows from the constraints of the primal problem. Case 2: 0 < α i < C C α i > 0 ξ i = 0 (y i f i + ξ i 1 α i = 0 y i f i 1 = 0 0 Case 3: α i = C C α i = 0 ξ i 0 (y i f i + ξ i 0 1 α i = 0 y i f i 1 = ξ i 0 =C>0 The sign in the first step follows from the constraints of the primal problem. Since we prefer an error cache (instead of a function cache, an optimal α i has to obey one of the following three conditions α i = 0 y i E i 0 or 0 < α i < C y i E i = 0 or α i = C y i E i 0. 3
In other words, α i needs to be updated if α i < C y i E i < 0 or α i > 0 y i E i > 0 SMO assumes α i to be optimal, if the KKT conditions are fulfilled with some precision τ. A typical value for classification tasks is τ = 10 3. Therefore SMO performs an update only if α i < C y i E i < τ or α i > 0 y i E i > τ 1.4 Updating the Cache As described above, a cache should be used to speed up the checking of the KKT conditions. If the parameter α 1 has been optimized, a function cache can be updated in the following way: f i = α j y j k ij = α 1 y 1 k i1 + j=1 = α 1 y 1 k i1 + (fi old α old α old j y j k ij 1 y 1k i1 = fi old + (α 1 α old 1 y 1k i1 With the abbreviation t 1 = (α 1 α old 1 y 1 (which is independent of i the update rule for f i can be written as f i = f old i + t 1 k i1 for i = 1,..., N. Since y i is constant, the update of an error cache works in the same way: E i = f i y i = f old i + t 1 k i1 y i = E old i + y i + t 1 k i1 y i = Ei old + t 1 k i1 Platt s original SMO algorithm updates only those E i (or f i, respectively with 0 < α i < C, because these are the only parameters involved in the second choice heuristic [1]. When SMO enters the takestep procedure only these parameters are retrieved form the cache for the KKT check all other values are re-computed. This trick is not such crucial for the simplified algorithm, since there is no second choice heuristic. It provides only a small acceleration of the algorithm. 2 SMO Regression without Bias Term SMO regression works similar to SMO classification. As the main difference, each data point requires two slack variables ξ i and ξ i measuring the distance above and below the ε-tube [2]. Only one of both parameters is different from zero. Consequently, the solution has 2N parameters but at least half of them are zero. As in the classification case, the standard algorithm can be simplified if the use of implicit bias kernel functions makes an explicit bias dispensable. 4
2.1 The Minimization Problem Also for regression an implicit bias eliminates the summation constrained of the dual problem, which is now given by L = 1 2 j=1 (α i α i (α j α j k ij + ε s.t. 0 α i, α i C for i = 1,..., N (α i + α i y i (α i α i = min α,α. with the dual variables α i and α i. Also for the dual variables the equation α i α i = 0 holds true. y i are the desired output values, and k ij = k(x i, x j are the kernel function values. The primal slack variables ξ i and ξ i only appear in the KKT conditions but not in the dual problem. The simplified algorithm chooses a sample i by Platt s first choice heuristic and then performs a joint optimization of α i and α i. An algorithm based on the original SMO method would have to optimize four parameters in one step to obey the summation constraint, as shown by Schölkopf and Smola [2]. 2.2 The Analytical Solution In contrast to classification (where α i does not exist, L is minimized with respect to both α i and α i in each step. Again, without loss of generality, we assume that α 1 and α 1 are the variables to be optimized, and separate them in L: L = 1 2 ( (α i α i (α 1 α 1 k i1 + (α j α j k ij N +ε(α 1 + α 1 +ε N (α i + α i y 1 (α 1 α 1 y i (α i α i } {{ } A 1 = 1 N 2 (α 1 α 1 (α i α i k i1 + 1 2 (α i α i } {{ } A 2 (α j α j k ij +ε(α 1 + α 1 y 1(α 1 α 1 + A 1 + A 2 = 1 2 (α 1 α 1 2 + 1 N 2 (α 1 α 1 (α i α i k i1 + 1 N 2 (α 1 α 1 (α j α j k 1j + 1 (α i α i (α j α j k ij +ε(α 1 + α 1 2 y 1(α 1 α 1 + A 1 + A 2 A 3 = 1 ( 2 k N 11(α 1 α 1 2 + (α j α j k 1j y 1 (α 1 α 1 + ε(α 1 + α 1 + A 1 + A 2 + A 3. With α i = α old i and α i = α old i for i = 2,..., N 5
and the cached SVM output values f i = j=1 (α old j α old j k ij follows and therefore (α j α j k 1j = L = 1 2 (α 1 α 1 2 + (α old j ( f 1 (α old = 1 2 α 2 1 + ( f 1 (α old 1 y 1 + ε α old j k 1j = f 1 (α old 1 1 y 1 (α 1 α 1 + ε(α 1 + α 1 + const. α 1 + 1 2 α 2 1 ( f 1 (α old 1 y 1 ε α 1 + const. This result is achieved because only one of the variables α 1 and α 1 can be different from zero, which implies α 1 α 1 = 0. Again, as we usually want to employ an error cache instead of a function cache, L reads as L = 1 ( 2 α 2 1 + E 1 (α old 1 + ε α 1 + 1 ( 2 α 2 1 E 1 (α old 1 ε α 1 + const. The solution of the unconstrained problem is found by evaluating the conditions L/ α 1 = 0, and L/ α 1 = 0, respectively. Since there are no cross terms, α 1 and α 1 can be determined independently: L ( = α 1 + E 1 (α old!= 1 + ε 0 α 1 α 1 is computed in the same way: L α 1 α 1 = E 1 + (α old 1 ε α 1 = α old 1 E 1 + ε = α 1 ( E 1 (α old 1 ε!= 0 α 1 = E 1 (α old 1 ε α 1 = (α old 1 + E 1 ε = α 1 2ε As mentioned above, L does not contain any cross terms which implies that the minimum of the constrained optimization problem can be found again by clipping α 1 and α 1 to the interval [0, C]: α new 1 = min { max { α 1, 0 }, C } α new 1 = min { max { α 1, 0}, C } 6
2.3 Checking the KKT Conditions Like in the classification case, the Karush-Kuhn-Tucker conditions are exploited for an optimality check. Here we have to consider the conditions for α i and for α i (ε + ξ i + f i y i α i = 0 (C α i ξ i = 0 (ε + ξ i f i + y i α i = 0 (C α i ξ i = 0 Starting with α i we have again to consider three different ways to obey the KKT conditions in the optimum: Case 1: α i = 0 C α i = C 0 ξ i = 0 (ε + ξ i +f i y i α i = 0 ε + f i y i 0 The sign in the last step follows from the constraints of the primal problem. Case 2: 0 < α i < C C α i > 0 ξ i = 0 (ε + ξ i +f i y i α i = 0 ε + f i y i = 0 0 Case 3: α i = C C α i = 0 ξ i 0 (ε + ξ i 0 +f i y i α i = 0 ε + f i y i = ξ i 0 =C>0 The sign in the first step follows from the constraints of the primal problem. Using (as usual the error values E i = f i y i we can summarize the conditions for an optimal solution α i by α i = 0 ε + E i 0 or 0 < α i < C ε + E i = 0 or α i = C ε + E i 0. In the same way we find the conditions for α i as α i = 0 ε E i 0 or 0 < α i < C ε E i = 0 or α i = C ε E i 0. 7
As α i and α i are jointly optimized, the solution is sub-optimal with respect to both variables if α i < C ε + E i < 0 or α i > 0 ε + E i > 0 or α i < C ε E i < 0 or α i > 0 ε E i > 0 SMO assumes α i and α i to be optimal, if the KKT conditions are fulfilled with some precision τ. In contrast to classification tasks (where the output values are ±1, a fixed τ is not suitable for regression problems. It should rather be dependent from the magnitude of the output values, e.g., τ = max { y 1,..., y N } 10 3. SMO performs a joint update of α i and α i if α i < C ε + E i < τ or α i > 0 ε + E i > τ or α i < C ε E i < τ or α i > 0 ε E i > τ 2.4 Updating the Cache Also for regression problems a cache should be used for quickly checking the KKT conditions. The update of the function values works in the following way: f i = (α j α j k ij = (α 1 α 1 k i1 + j=1 = (α 1 α 1 k i1+ ( fi old (α old With the auxiliary variable t 1 = α 1 α 1 αold equation can be written as or, if an error cache is employed, as (α old j 1 k i1 = f old i 1 +α old α old j k ij f i = f old i + t 1 k i1 for i = 1,..., N, E i = E old i + t 1 k i1 for i = 1,..., N. + (α 1 α 1 αold 1 + α old 1 k i1 1, which is independent from i, the update References [1] John C. Platt: Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Kernel Methods Support Vector Learning. B. Schölkopf, C. Burges, and A. Smola, eds., MIT Press, 1999. [2] Bernhard Schölkopf and Alexander Smola: Learning with Kernels Support Vector Machines, Optimization, and Beyond. MIT Press, 2002. 8