Working Set Selection Using Second Order Information for Training SVM

Working Set Selection Using Second Order Information for Training SVM Chih-Jen Lin Department of Computer Science National Taiwan University Joint work with Rong-En Fan and Pai-Hsuen Chen Talk at NIPS 2005 Workshop on Large Scale Kernel Machines. p.1/19

Outline Large dense quadratic programming in SVM Decomposition methods and working set selections A new selection based on second order information Results and analysis This work appears in JMLR 2005. p.2/19

SVM Dual Optimization Problem Large dense quadratic problem min α subject to 1 2 αt Qα e T α 0 α i C,i = 1,...,l y T α = 0, l: # of training data Q: l by l fully dense matrix y i = ±1 e = [1,...,1] T Difficult as Q is fully dense in general. p.3/19

Do we really need to solve the dual? Maybe not. Sometimes data too large to do so Approximating either from primal or dual side However, in certain situations we still hope to solve it This talk: a faster algorithm and implementation. p.4/19

Decomposition Methods Working on a few variable each time Similar to coordinate-wise minimization Working set B, N = {1,...,l}\B fixed Size of B usually <= 100 Sub-problem in each iteration: min α B 1 2 [ ] [ α T Q BB Q BN B (αk N )T [ e T B (ek N )T ] [ α B α k N Q NB ] Q NN ] [ ] α B α k N subject to 0 (α B ) t C,t = 1,...,q, y T B α B = y T N αk N. p.5/19

Sequential Minimal Optimization (SMO) Consider B = {i,j}; that is, B = 2 (Platt, 1998) Extreme of decomposition methods Sub-problem analytically solved; no need to use optimization software 1 min α i,α j 2 [ ] [ ] [ Q ii Q ij α i α j Q ij Q jj s.t. 0 α i,α j C, y i α i + y j α j = y T N αk N, α i α j ] + (Q BN α kn e B) T [ This work focuses on selecting two elements α i α j ]. p.6/19

Existing Selection by Gradient Let d [d B,0 N ]. Minimizing f(α k + d) f(α k ) + f(α k ) T d = f(α k ) + f(α k ) T B d B. Solve min d B f(α k ) T B d B subject to y T B d B = 0, d t 0, if α k t = 0,t B, d t 0, if αt k = C,t B, (1b) 1 d t 1,t B B = 2 (1a). p.7/19

First considered in (Joachims, 1998) 0 α t C leads to (1a) and (1b). 0 αt k + d t d t 0, if αt k = 0, αt k + d t C d t 0, if αt k = C α + d may not be feasible. OK for finding working sets 1 d t 1,t B avoid objective value. p.8/19

Rewritten as checking first order approximation at different sub-problems of B where {i,j} = arg min B: B =2 Sub(B), Sub(B) min d B f(α k ) T B d B subject to y T B d B = 0, d t 0, if α k t = 0,t B, d t 0, if α k t = C,t B, 1 d t 1,t B. Checking all ( l 2) possible B s?. p.9/19

Solution of Using Gradient Information O(l) procedure where i arg max t I up (α k ) y t f(α k ) t, j arg min y t f(α k ) t, t I low (α k ) I up (α) {t α t < C,y t = 1 or α t > 0,y t = 1}, and I low (α) {t α t < C,y t = 1 or α t > 0,y t = 1}. This usually called maximal violating pair. p.10/19

Better Working Set Selection Difficult: # iter ց but cost per iter ր May not imply shorter training time A selection by second order information (Fan et al., 2005) As f is a quadratic, f(α k + d) = f(α k ) + f(α k ) T d + 1 2 dt 2 f(α k )d = f(α k ) + f(α k ) T B d B + 1 2 dt B 2 f(α k ) BB d B. p.11/19

Selection by Second-Order Information Using second order information min Sub(B), B: B =2 Sub(B) min d B 1 2 dt B 2 f(α k ) BB d B + f(α k ) T B d B subject to y T B d B = 0, d t 0, if α k t = 0,t B, d t 0, if α k t = C,t B. 1 d t 1,t B not needed if Q BB PD Too expensive to check ( l 2) sets. p.12/19

A heuristic 1. Select 2. Select 3. Return B = {i,j}. i arg max{ y t f(α k ) t t I up (α k )}. t j arg min t {Sub({i,t}) t I low (α k ), y t f(α k ) t < y i f(α k ) i }. The same i as using the gradient information Check only O(l) B s to decide j. p.13/19

Sub({i, t}) can be easily solved If K ii + K jj 2K ij > 0, Sub({i,t}) = ( y i f(α k ) i + y t f(α k ) t ) 2 2(K ii + K tt 2K it ) Convergence established in (Fan et al., 2005) Details not shown here. p.14/19

mg 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Comparison of Two Selections Iteration and time ratio between using second-order information and maximal violating pair time (40M cache) time (100K cache) total #iter image splice tree a1a australian breast-cancer diabetes fourclass german.numer w1a abalone cadata cpusmall space_ga Data sets. p.15/19 Ratio

A complete comparison is not easy Try enough data sets Consider parameter selection Details not shown here. p.16/19

More about Second-Order Selection What if we check all ( l 2) sets Iteration ratio between checking all and checking O(l) : 1.4 1.3 parameter selection final training 1.2 1.1 1 Ratio 0.9 0.8 0.7 0.6 0.5 0.4 image splice tree a1a australian breast-cancer diabetes fourclass german.numer w1a Data sets Fewer iterations, but ratio (0.7 to 0.8) not enough to justify the higher cost per iteration. p.17/19

Why not Keeping Feasibility? min d B 1 2 dt B 2 f(α k ) BB d B + f(α k ) T B d B Two types of constraints: y T B d B = 0, y T B d B = 0, d t 0, if α k t = 0,t B, 0 α k + d t C,t B d t 0, if αt k = C,t B Related work (Lai et al., 2003a,b) Heuristically select some pairs Check function reduction while keeping feasibility Higher cost in selecting working sets We proved: at final iterations two are indeed the same. p.18/19

Conclusions Finding better working sets for SVM decomposition methods is difficult We proposed one based on second order information Results better than the commonly used selection from first order information Implementation in LIBSVM (after version 2.8) http://www.csie.ntu.edu.tw/~cjlin/libsvm Replacing the maximal violating pair selection. p.19/19