Solving the SVM Optimization Problem

Size: px

Start display at page:

Download "Solving the SVM Optimization Problem"

Linda Cole
5 years ago
Views:

1 Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany July 16, 2009 Christian Igel: Solving the SVM Optimization Problem 1 / 26

2 Outline Warm-up: Newton Step SVM Optimization: Primal and Dual Optimality Criteria Decomposition Algorithms Working Set Selection Christian Igel: Solving the SVM Optimization Problem 2 / 26

3 Outline Warm-up: Newton Step SVM Optimization: Primal and Dual Optimality Criteria Decomposition Algorithms Working Set Selection Christian Igel: Solving the SVM Optimization Problem 3 / 26

4 Warmup: Recall Taylor expansion & Newton step for an at least twice differentiable function D : R n R D(α + λ) = D(α) + D(α) λ λ λt 2 D(α) 2 λ λ + O( λ 3 ) as λ 0 Christian Igel: Solving the SVM Optimization Problem 4 / 26

5 Warmup: Recall Taylor expansion & Newton step for a convex quadratic function D : R R the problem min λ D(α + λ) is solved by λ = D(α) λ 2 D(α) 2 λ Christian Igel: Solving the SVM Optimization Problem 5 / 26

6 Outline Warm-up: Newton Step SVM Optimization: Primal and Dual Optimality Criteria Decomposition Algorithms Working Set Selection Christian Igel: Solving the SVM Optimization Problem 6 / 26

7 1-Norm soft margin SVM: Primal training data S = {(x 1,y 1 ),...,(x l,y l )}, and kernel k 1-Norm Soft Margin SVM Primal optimization problem: minimize ξ,w,b P(ξ,w,b) = 1 2 w,w + C l i=1 subject to y i ( w,k(x i, ) + b) 1 ξ i, i = 1,...,l ξ i 0, i = 1,...,l ξ i Topic today: How to implement an efficient algorithm that solves this problem? Christian Igel: Solving the SVM Optimization Problem 7 / 26

8 1-Norm soft margin SVM: Dual training data S = {(x 1,y 1 ),...,(x l,y l )}, and kernel k 1-Norm Soft Margin SVM Dual optimization problem: maximize α D(α) = subject to l α i 1 2 i=1 l α i α j y i y j k(x i,x j ) i,j=1 l α i y i = 0 and i {1,...,l} : i=1 C α i 0y i α i [a i,b i ] = { [0,C] if y i = +1 [ C,0] if y i = 1 decision rule sign(f(x)) with f(x) = l i=1 y iα i k(x i,x) + b, where b is chosen so that y i f(x i ) = 1 for any i with C > α i > 0 Christian Igel: Solving the SVM Optimization Problem 8 / 26

9 Restricting the quadratic program 1-Norm Soft Margin SVM dual optimization problem restricted to working set B: maximizeˆα D(ˆα) = subject to l ˆα i 1 2 i=1 l ˆα i y i = 0 i=1 l i,j=1 i {1,...,l} : C ˆα i 0 i B : ˆα i = α i ˆα iˆα j y i y j k(x i,x j ) Christian Igel: Solving the SVM Optimization Problem 9 / 26

10 Two-dimensional subproblem C ˆα ĝ ˆα ˆα j ĝ 0 ˆα i C Christian Igel: Solving the SVM Optimization Problem 10 / 26

11 Outline Warm-up: Newton Step SVM Optimization: Primal and Dual Optimality Criteria Decomposition Algorithms Working Set Selection Christian Igel: Solving the SVM Optimization Problem 11 / 26

12 Some definitions let s define Gram matrix entry gradient of dual K ij = k(x i,x j ) g i = D(α) α i l = 1 y i y j α j K ij j=1 and the index sets I up = {i y i α i < b i } (y i α i may increase) I down = {i y i α i > a i } (y i α i may decrease) SV x i is called bounded if α i = C, otherwise it is called free (and i I up i I down ) Christian Igel: Solving the SVM Optimization Problem 12 / 26

13 Optimality & stopping criterion is necessary for i I up and j I down (w.l.o.g. i < j) we define u ij,α ǫ R l α ǫ = α + ǫu ij with u ij = (0,...,y i,0,...,0, y j,0,...,0) for ǫ > 0, i.e., +ǫy k if k = i α ǫ k = α k + ǫ[u ij] k = α k + ǫy k if k = j 0 otherwise if α is optimal, we have D(α ǫ ) D(α ) = ǫ(y i g i y jg j ) + O(ǫ2 ) and thus the necessary optimality criterion or r R : max i I up y i g i r min j I down y j g j r R : k : { α k = C α k = 0 if g k > y kr if g k < y kr Christian Igel: Solving the SVM Optimization Problem 13 / 26

14 Optimality & stopping criterion is sufficient let s consider some feasible solution α of D and pick w = l y i α ik(x i, ), b = r, ξi = max{0,gi y i r} i=1 we have P(ξ,w,b ) D(α ) = C l ξi i=1 noting that Cξi α i g i = y iα i r we get l l α i g i = (Cξi α i g i ) i=1 P(ξ,w,b ) D(α ) = r thus the duality gap vanishes, α is optimal b = r is a good way to compute b i=1 l y i α i = 0 Christian Igel: Solving the SVM Optimization Problem 14 / 26 i=1

15 How long does training an SVM take? Intuitive bounds: assume an oracle tells us the unbounded SVs F = {x i 0 < α i < C} and bounded SVs, then computing the αs takes O( F 3 ) checking the optimality condition by computing the gradient from scratch takes O(l #SV) SVM training scales between quadraticly and cubicly in the number of training points. Christian Igel: Solving the SVM Optimization Problem 15 / 26

16 Outline Warm-up: Newton Step SVM Optimization: Primal and Dual Optimality Criteria Decomposition Algorithms Working Set Selection Christian Igel: Solving the SVM Optimization Problem 16 / 26

17 Decomposition algorithms Strategy: Iteratively solve dual optimization problem Decomposition Algorithm α feasible starting point repeat select working set B solve QP restricted to B resulting in ˆα α ˆα until stopping criterion is met B = {i,j}, i < j: Sequential Minimal Optimization (SMO) search directions are just ±(0,...,y i,0,...,0, y j,0,...,0) = ±u ij Christian Igel: Solving the SVM Optimization Problem 17 / 26

18 Decomposition and Sequential Minimal Optimization repeat until target accuracy reached { select working set solve optimally α i α j α i } α k α j Christian Igel: Solving the SVM Optimization Problem 18 / 26

19 Solving the two-dimensional subproblem Hessian of dual problem has elements 2 D(α) α i α j = y i y j K ij maximizing w.r.t. λ (ignoring box constraints) D(α + λu ij ) D(α) = λ(y i g i y j g j ) λ2 2 (K ii + K jj 2K ij ) by Newton step gives optimal λ λ = y i g i y j g j K ii + K jj 2K ij Christian Igel: Solving the SVM Optimization Problem 19 / 26

20 Recomputing gradient & stopping criterion gradient of full problem can be adjusted after optimizing on B by k {1,...,l} : g k g k y k y i (ˆα i α i )K ik stopping criterion (which needs gradient) is in practice softened to for ǫ > 0 i B max y i gi min y j gj 0 i I up j I down max y i g i min y j g j ǫ i I up j I down Christian Igel: Solving the SVM Optimization Problem 20 / 26

21 Sequential minimal optimization Sequential minimal optimization α 0, g 1 repeat select indices i I up and j I { down } y i g i y j g j λ = min b i y i α i,y j α j a j, K ii + K jj 2K ij k {1,...,l} : g k g k λy k K ik + λy k K jk α i α i + y i λ α j α j y j λ until max i Iup y i g i min j Idown y j g j ǫ Christian Igel: Solving the SVM Optimization Problem 21 / 26

22 Outline Warm-up: Newton Step SVM Optimization: Primal and Dual Optimality Criteria Decomposition Algorithms Working Set Selection Christian Igel: Solving the SVM Optimization Problem 22 / 26

23 Working set selection, Most violating pair Problem: How to select the working set B such that 1 much progress is made /only few iterations are needed, and 2 few kernel evaluations are required? we ignore Gram matrix caching/ chunking / shrinking in this course and just consider the selection of B standard algorithm: Most violating pair working set selection 1 first index i = argmax k Iup y k g k 2 second index j = argmin k Idown y k g k first order working set selection, recall that for λ 0 D(α + λu ij ) D(α) = λ(y i g i y j g j ) + O(λ 2 ) requires just O(l) computations Christian Igel: Solving the SVM Optimization Problem 23 / 26

24 Maximum gain maximizing gain of subproblem in search direction u ij ignoring box constraints corresponds to maximizing w.r.t. λ D(α + λu ij ) D(α) = λ(y i g i y j g j ) λ2 2 (K ii + K jj 2K ij ) Newton step gives optimal λ λ = y i g i y j g j K ii + K jj 2K ij yielding a gain D(α + λ u ij ) D(α) of (y i g i y j g j ) 2 2(K ii + K jj 2K ij ) Christian Igel: Solving the SVM Optimization Problem 24 / 26

25 Maximum gain working set selection idea: select i and j such that gain D(α + λ u ij ) D(α) = is maximized (ignoring box constraints) (y i g i y j g j ) 2 2(K ii + K jj 2K ij ) problem: checking all l(l 1)/2 index pairs is not feasible solution: 1 first index i is picked according to most violating pair heuristic 2 second index j is selected to maximize gain second order working set selection requires just O(l) computations (given reasonable caching strategy) Christian Igel: Solving the SVM Optimization Problem 25 / 26

26 References L. Bottou and C.-J. Lin. Support Vector Machine Solvers. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, eds.: Large Scale Kernel Machines, MIT Press, R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using the second order information for training SVM. Journal of Machine Learning Research 6, pp , J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, eds.: Advances in Kernel Methods Support Vector Learning, chapter 12, pp , MIT Press, T. Joachims. Making Large-Scale SVM Learning Practical. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, eds.: Advances in Kernel Methods Support Vector Learning, chapter 11, pp , MIT Press, T. Glasmachers and C. Igel. Maximum-Gain Working Set Selection for SVMs. Journal of Machine Learning Research 7, pp , Christian Igel: Solving the SVM Optimization Problem 26 / 26

Sequential Minimal Optimization (SMO)

Sequential Minimal Optimization (SMO) Data Science and Machine Intelligence Lab National Chiao Tung University May, 07 The SMO algorithm was proposed by John C. Platt in 998 and became the fastest quadratic programming optimization algorithm,