Sequential Minimal Optimization (SMO)

Size: px

Start display at page:

Download "Sequential Minimal Optimization (SMO)"

Anissa Thomas
5 years ago
Views:

1 Data Science and Machine Intelligence Lab National Chiao Tung University May, 07

2 The SMO algorithm was proposed by John C. Platt in 998 and became the fastest quadratic programming optimization algorithm, especially for linear SVM and sparse data performance. One of the best reference about SMO is Sequential Minimal Optimization A Fast Algorithm for Training Support Vector Machines written by John C. Platt.

3 Sequential Not parallel Optimize in sets of Lagrange multipliers Minimal Optimize smallest possible sub-problem at each step Optimization Satisfy the constraints for the chosen pair of Lagrange multipliers

4 The algorithm is derived by taking the idea of the decomposition method to its extreme and optimizing a minimal subset of just two points at each iteration. The power of this technique resides in the fact that the optimization problem for two data points admits an analytical solution, eliminating the need to use an iterative quadratic programming optimizer as part of the algorithm. The requirement that the condition l i= y iα i = 0 is enforced throughout the iterations implies that the smallest number of multipliers that can be optimized at each step is : whenever one multiplier is updated, at least one other multiplier needs to be adjusted in order to keep the condition true.

5 At each step SMO chooses two elements α i and α j to jointly optimize, find the optimal values for those two parameters given that all the others are fixed, and updates the α vector accordingly. The choice of the two points is determined by a heuristic, while the optimization of the two multipliers is performed analytically. Despite needing more iterations to converge, each iteration uses so few operations that the algorithm exhibits an overall speed-up of some orders of magnitude.

6 Besides convergence time, other important features of the algorithm are that it does not to store the kernel matrix in memory, since no matrix operations are involved, that it dose not use other packages, and that it is fairly easy to implement. Note that since standard SMO does not use a cached kernel matrix, its introduction could be used to obtain a further speed-up, at the expense of increased space complexity.

7 -norm Soft Margin Primal Form subject to min w,b,ξ i w + C ξ i () i= ) y i (w x i + b + ξ i 0 () ξ i 0, i =,,..., l (3)

8 -norm Soft Margin Dual Form max W (α) = α i α i= α i α j y i y j K(x i, x j ) (4) i= j= subject to α i y i = 0 i= 0 α i C ; i =,,..., l where K(x i, x j ) x i x j

9 Without loss of generality we will assume that two elements, α, α that have been chosen for updating to improve the objective value. In order to compute the new values for these two parameters, one can observe that in order not to violate the linear constraint α i y i = 0, The new values of the i= multipliers must be on a line, y α (old) + y α (old) = constant = y α + y α (5) in (α, α ) space, and in the box defined by 0 α, α C as shown in the following figure:

10 Without loss of generality, the algorithm first compute α (new) and successively use it to obtain α (new). The box constraint 0 α, α C together with the linear equality constraint, provides a more restrictive constraint on the feasible values for α (new) : where U and V are defined as follows: U α (new) V, (6)

11 The Bound for α new if y y { U = max{0, α (old) α (old) }, V = min{c, C α (old) + α (old) } (7) if y = y { U = max{0, α (old) + α (old) C}, V = min{c, α (old) + α (old) } (8)

12 Theorem The maximun of the objective function for the optimization problem max W (α) = α i α i= α i α j y i y j K(x i, x j ) i= j= subject to α i y i = 0 n= 0 α i C ; i =,,... l

13 When only α and α are allowed to change, is achieved by first computing the quantity { } y α (new,unc) = α (old) E (old) E (old) + (9) κ where E i f (x i ) y i = α j y j K ji + b y i ; i =,, (0) j= and clipping it to enforce the constraint U α (new) V : V, α (new) = if α (new,unc) > V α (new,unc), if U α (new,unc) V U, if α (new,unc) < U ()

14 where U and V is defined by if y y { U = max{0, α (old) α (old) }, V = min{c, C α (old) + α (old) } () if y = y { U = max{0, α (old) + α (old) C}, V = min{c, α (old) + α (old) } and the value of α (new) is obtained from α (new) as ( α (new) = α (old) + y y α (old) α (new) ) (3) (4)

15 Proof : For representation simplicity, let s define the following symbols for each element of matrix K K(x i, x j ) K ij (5) f (x i ) α j y j K ji + b (6) v i j= α j y j K ij = f (x i ) j=3 α j y j K ij b ; i =,, (7) j= E i f (x i ) y i = α j y j K ji + b y i ; i =,, (8) j=

16 Consider the objective as function of α and α : W (α, α ) = α + α K α K α α α y y K y α y α y j α j K j j=3 y j α j K j + j=3 i=3 α i α j y i y j K ij (9) i=3 j=3 α i

17 Substitute (6) (7) (8) and into (9) yields W (α, α ) = α + α K α K α α α y y K y α v y α v + constant (0) Note also that the constraint the condition i= α (old) i y i = α i y i = 0, implies i= α + sα = constant = α (old) + sα (old) = γ () where s = y y. The above equation demonstrates how α (new) is computed from α (new). α = γ sα ()

18 Eliminating α in (0), we have the objective function as α W (α ) = γ sα + α K (γ sα ) K α sk (γ sα )α y (γ sα )v y α v + constant = K ( γ γsα + α ) K α + s K α + ( s sk γ) α y v y v + constant = (K K K )α + ( s + K sγ K sγ + y v y v ) α + constant (3)

19 The stationary points satisfies This yields dw (α ) dα = (K K K )α + ( s + K sγ K sγ + y v y v ) = 0 (4) α (new,unc) (K + K K ) = s + K sγ K sγ + y v y v = s + (K K )sγ + y (v v ) (5)

20 multiplier (5) by y, it is easy to see α (new,unc) κy = y y + (K K )y γ + v v = y y + (K K )y γ + f (x ) f (x ) y j α j K j j= y j α j K j (6) j= and y γ = y (α + sα ) = y α + y α (7)

21 Since y j α j K j j= y j α j K j = y α K + y α K y α K + y α K j= Substitute(7) (8) into (6), we can find (8) α (new,unc) κy = y y + (K K )(α y + α y ) + y α K + y α K y α K + y α K + f (x ) f (x ) = y y + f (x ) f (x ) + y α K y α K + y α K y α K = y α κ + (f (x ) y ) (f (x ) y ) (9)

22 So we have { } y α (new) = α (old) E (old) E (old) + κ where E i f (x i ) y i = α j y j K ji + b y i ; i =,, j= Finally, we must clip α (new,unc) if necessary to ensure it remains in the interval [U, V ].

23 Discussion: If s = y y = (y = y ), then α + α = γ. If γ > C, then max α = C = min V and min α = γ C = α + α C = max U.See Figure. If γ < C, then max α = γ = α + α = min V and min α = 0 = max U.See Figure.3 (a) Fig.. γ > C (b) Fig..3 γ < C Figure: If s = y y =, then α + α = γ

24 If s = y y = (y y ), then α α = γ If γ > 0, then max α = C γ = C α + α = min V and min α = 0 = max U. See Figure.4. If γ < 0, then max α = C = min V and min α = γ = α + α = max U. See Figure.5. (a) Fig..4 γ > C (b) Fig..5 γ < C Figure: If s = y y =, then α α = γ

25 From above discussion, we can find α must lie in the following range to make sure it is clipped: max α = min V (30) min α = max U (3) where U and V is given by if y y { U = max{0, α (old) α (old) }, V = min{c, C α (old) + α (old) } (3) if y = y { U = max{0, α (old) + α (old) C}, V = min{c, α (old) + α (old) } (33)

26 and the value of α (new) is obtained from α (new) as ( α (new) = α (old) + y y α (old) α (new) Clipping it to enforce the constraint U α (new,clipped) V : V, α (new,clipped) = if α (new,unc) > V α (new,unc), if U α (new,unc) V U, if α (new,unc) < U ) (34) (35)

27 Figure: Two possible situations for the update of α and α in SMO. (a) s = y y = and (b) s = y y =

28 Remark E i f (x i ) y i = α i y i K ij + b y i ; i =,, (36) j= where f (x) denote the current hypothesis determined by the value of α and b at a particular stage of learning, and E i is the difference between function output and target classification on the training point x or x. Note E i can be large if a point is correctly classified. For example if y =,and the function output is f (x ) = 5, the classification is correct, but E = 4.

29 Remark d W (α ) dα = K K + K κ 0 Proof. κ K K + K = x x x x x x (37) = (x x ) (x x ) = x x 0

30 Nello Cristianini (Author), John Shawe-Taylor (Author) An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 000

Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique