On the Equality of Kernel AdaTron and Sequential Minimal Optimization in Classification and Regression Tasks and Alike Algorithms for Kernel

Proceedngs of th European Symposum on Artfca Neura Networks, pp. 25-222, ESANN 2003, Bruges, Begum, 2003 On the Equaty of Kerne AdaTron and Sequenta Mnma Optmzaton n Cassfcaton and Regresson Tasks and Ake Agorthms for Kerne Machnes Vosav Kecman, Mchae Vogt 2, Te Mng Huang Schoo of Engneerng, The Unversty of Auckand, Auckand, New Zeaand 2 Insttute of Automatc Contro, TU Darmstadt, Darmstadt,, Germany e-ma: v.kecman@auckand.ac.nz, mvogt@at.tu-darmstadt.de Abstract: The paper presents the equaty of a kerne AdaTron (KA) method (orgnatng from a gradent ascent earnng approach) and sequenta mnma optmzaton (SMO) earnng agorthm (based on an anaytc quadratc programmng step) n desgnng the support vector machnes (SVMs) havng postve defnte kernes. The condtons of the equaty of two methods are estabshed. The equaty s vad for both the nonnear cassfcaton and the nonnear regresson tasks, and t sheds a new ght to these seemngy dfferent earnng approaches. The paper aso ntroduces other earnng technques reated to the two mentoned approaches, such as the nonnegatve conugate gradent, cassc Gauss-Sede (GS) coordnate ascent procedure and ts dervatve known as the successve over-reaxaton (SOR) agorthm as a vabe and usuay faster tranng agorthms for performng nonnear cassfcaton and regresson tasks. The convergence theorem for these reated teratve agorthms s proven.. Introducton One of the manstream research feds n earnng from emprca data by support vector machnes, and sovng both the cassfcaton and the regresson probems, s an mpementaton of the ncrementa earnng schemes when the tranng data set s huge. Among severa canddates that avod the use of standard quadratc programmng (QP) sovers, the two earnng approaches whch have recenty got the attenton are the KA (Anauf, Beh, 989; Freß, Crstann, Campbe, 998; Veropouos, 200) and the SMO (Patt, 998, 999; Vogt, 2002). Due to ts anaytca foundaton the SMO approach s partcuary popuar and at the moment the wdest used, anayzed and st heavy deveopng agorthm. At the same tme, the KA athough provdng smar resuts n sovng cassfcaton probems (n terms of both the accuracy and the tranng computaton tme requred) dd not attract that many devotees. There are two basc reasons for that. Frst, unt recenty (Veropouos, 200), the KA seemed to be restrcted to the cassfcaton probems ony and second, t 'acked' the feur of the strong theory (despte ts beautfu 'smpcty' and strong convergence proofs). The KA s based on a gradent ascent technque and ths fact mght have aso dstracted some researchers beng aware of probems wth gradent ascent approaches faced wth possby -condtoned kerne matrx. Here we show when and why the recenty deveoped agorthms for SMO usng postve defnte kernes or modes

wthout a bas term, (Vogt, 2002), and the KA for both cassfcaton (Fress, Crstann, Campbe, 998) and regresson (Veropouos, 200) are dentca. Both the KA and the SMO agorthm attempt to sove the foowng QP probem n the case of cassfcaton (Vapnk, 995; Cherkassky and Muer, 998; Crstann and Shawe- Tayor, 2000; Kecman, 200; Schökopf and Smoa, 2002) - maxmze the dua Lagrangan L d (α) = α (, ) y y α α K x x, () 2 =, = subect to α 0, =,, and α y = 0. (2) where s the number of tranng data pars, α are the dua Lagrange varabes, y are the cass abes (±), and the K( x, x ) are the kerne functon vaues. Because of nose or generc cass features, there w be an overappng of tranng data ponts. Nothng, but constrants, n sovng () changes and they are 0 α C, =,..., and = α y = 0, (3) where 0 < C <, s a penaty parameter tradng off the sze of a margn wth a number of mscassfcatons. In the case of the nonnear regresson the earnng probem s the maxmzaton of a dua Lagrangan beow L d (α,α )= ε ( α + α ) + ( ) ( )( ) (, ) α α y α α α α K x x,(4) 2 = =, = = = = s.t. α α, (4a) 0 α C, 0 α C, =,...,. (4b) where ε s a prescrbed sze of the nsenstvty zone, and α and α ( =,..., ) are Lagrange mutpers for the ponts above and beow the regresson functon respectvey. Learnng resuts n Lagrange mutper pars (α, α ). Because no tranng data can be on both sdes of the tube, ether α or α w be nonzero,.e., α α = 0. = 2. The KA and SMO earnng agorthms wthout-bas-term It s known that postve defnte kernes (such as the most popuar and the most wdey used RBF Gaussan kernes as we as the compete poynoma ones) do not requre bas term (Evgenou, Pont, Poggo, 2000). Beow, the KA and the SMO agorthms w be presented for such a fxed (.e., no-) bas desgn probem and compared for the cassfcaton and regresson cases. The equaty of two earnng schemes and resutng modes w be estabshed. Orgnay, n (Patt, 998, 999), the SMO cassfcaton agorthm was deveoped for sovng the probem () ncudng the constrants reated to the bas b. In these eary pubcatons the case when bas b s fxed varabe was aso mentoned but the detaed anayss of a fxed bas update was not accompshed.

2. Incrementa Learnng n Cassfcaton a) Kerne AdaTron n cassfcaton The cassc AdaTron agorthm as gven n (Anauf and Beh, 989) s deveoped for near cassfer. The KA s a varant of the cassc AdaTron agorthm n the feature space of SVMs (Freß et a., 998). The KA agorthm soves the maxmzaton of the dua Lagrangan () by mpementng the gradent ascent agorthm. The update of the dua varabes α s gven as α ( y ) y K (, ) ( y f ) = α = η = η α x x = η, (5a) where f s the vaue of the decson functon f at the pont x,.e., f = α y K ( x, x ), and y denotes the vaue of the desred target (or the cass' = abe) whch s ether + or -. The update of the dua varabes α s gven as α mn(max(0, α + α ), ) ( =,..., ) (5b) C In other words, the dua varabes α are cpped to zero f ( α + α ) < 0. In the case of the soft nonnear cassfer (C < ) α are cpped between zero and C, (0 α C). The agorthm converges from any nta settng for the Lagrange mutpers α. b) SMO wthout-bas-term n cassfcaton Recenty (Vogt, 2002) derved the update rue for mutpers α that ncudes a detaed anayss of the Karush-Kuhn-Tucker (KKT) condtons for checkng the optmaty of the souton. (As referred above, a fxed bas update was mentoned n Patt's papers). The foowng update rue for α for a no-bas SMO agorthm was proposed y E y f y f α = = = K ( x, x ) K ( x, x ) K ( x, x ), (6) where E = f y denotes the dfference between the vaue of the decson functon f at the pont x and the desred target (abe) y. Note the equaty of (5a) and (6) when the earnng rate n (5a) s chosen to be η = / K ( x, x ). The mportant part of the SMO agorthm s to check the KKT condtons wth precson τ (e.g., τ = 0-3 ) n each step. An update s performed ony f α < C y E < τ, or (6a) α > 0 y E > τ After an update, the same cppng operaton as n (5b) s performed α mn(max(0, α + α ), C) ( =,..., ) (6b) It s the nonnear cppng operaton n (5b) and n (6b) that strcty equas the KA and the SMO wthout-bas-term agorthm n sovng nonnear cassfcaton probems. Ths fact sheds new ght on both agorthms. Ths equaty s not that obvous n the case of a 'cassc' SMO agorthm wth bas term due to the heurstcs nvoved n the seecton of actve ponts whch shoud ensure the argest ncrease of the dua Lagrangan L d durng the teratve optmzaton steps.

2.2 Incrementa Learnng n Regresson Smary to the case of cassfcaton, there s a strct equaty between the KA and the SMO agorthm when postve defnte kernes are used for nonnear regresson. a) Kerne AdaTron n regresson The frst extenson of the Kerne AdaTron agorthm for regresson s presented n (Veropouos, 200) as the foowng gradent ascent update rues for α and α α ( ( ) (, ) = η = η y ε α ) ( ) ( ) α K x x = η y ε f = η E + ε, (7a) = ( y K ) ( y f ) ( E ) L α = η = η ε + ( α α ) (, ) η ε η ε x x = + =, (7b) d = α where y s the measured vaue for the nput x, ε s the prescrbed nsenstvty zone, and E = f y stands for the dfference between the regresson functon f at the pont x and the desred target vaue y at ths pont. The cacuaton of the gradent above does not take nto account the geometrc reaty that no tranng data can be on both sdes of the tube. In other words, t does not use the fact that ether α or α or both w be nonzero..e., that α α = 0 must be fufed n each teraton step. Beow we derve the gradents of the dua Lagrangan L d accountng for geometry. Ths new formuaton of the KA agorthm strcty equas the SMO method and t s gven as = K( x, x ) α ( α ) (, ) (, ) (, ), α K x x + y ε + K x x α K x x α = (8a) = K( x, x ) α ( α α ) K( x, x ) ( α α ) K( x, x ) + y ε =, ( x x ) = K( x, x ) α + y ε f = K(, ) α + E + ε For the α mutpers, the vaue of the gradent s L = K( x, ) d x α + E ε. (8b) The update vaue for α s now α = η = η ( K( x, x ) α + E + ε ), (9a) α α + α = α + η = α η ( K( x, x ) α + E + ε ) (9b) For the earnng rate η = / K( x, x ) the gradent ascent earnng KA s defned as, E + ε α α α K( x, x ) Smary, the update rue for α s E ε α α α + K( x, x ) (0a) (0b) Same as n the cassfcaton, α and α are cpped between zero and C, α mn(max(0, α ), ) ( =,..., ), (a) C α C α mn(max(0, ), ) ( =,..., ). (b)

b) SMO wthout-bas-term n regresson The frst agorthm for the SMO wthout-bas-term n regresson (together wth a detaed anayss of the KKT condtons for checkng the optmaty of the souton) s derved n (Vogt, 2002). The foowng earnng rues for the Lagrange mutpers α and α updates were proposed E + ε α α α, (2a) K( x, x ) E ε α α α +. (2b) K( x, x ) The equaty of equatons (0a, b) and (2a, b) s obvous when the earnng rate, as presented above n (0a, b), s chosen to be η = / K( x, x ). Thus, n both the cassfcaton and the regresson, the optma earnng rate s not necessary equa for a tranng data pars. For a Gaussan kerne, η = s same for a data ponts, and for a compete n th order poynoma each data pont has dfferent earnng rate T n η = /( x x + ). Smar to cassfcaton, a ont update of α and α s performed ony f the KKT condtons are voated by at east τ,.e. f α < C ε + E < τ, or α > 0 ε + E > τ, or α < C ε E < τ, or α > 0 ε E > τ After the changes, the same cppng operatons as defned n () are performed α mn(max(0, α ), ) ( =,..., ), (4a) C α C (3) α mn(max(0, ), ) ( =,..., ). (4b) The KA earnng as formuated n ths paper and the SMO agorthm wthout-basterm for sovng regresson tasks are strcty equa n terms of both the number of teratons requred and the fna vaues of the Lagrange mutpers. The equaty s strct despte the fact that the mpementaton s sghty dfferent. In every teraton step, namey, the KA agorthm updates both weghts α and α wthout any checkng whether the KKT condtons are fufed or not, whe the SMO performs an update accordng to equatons (3). 3. The Coordnate Ascent Based Learnng for Nonnear Cassfcaton and Regresson Tasks When postve defnte kernes are used, the earnng probem for both tasks s same. In a vector-matrx notaton, n a dua space, the earnng s represented as: maxmze L ( α ) = 0.5 α T K α + f T α (5) d s.t. 0 <= α <= C, ( =,..., n), (6)

where, n the cassfcaton n = and the matrx K s an (, ) symmetrc postve defnte matrx, whe n regresson n = 2 and K s a (2, 2) symmetrc sempostve defnte one. Note that the constrants (6) defne a convex subspace over whch the convex dua Lagrangan shoud be maxmzed. It s very we known that the vector α may be ooked at as the souton of a system of near equatons K α = f (7) subect to the same constrants as gven by (6). Thus, t may seem natura to sove (7), subect to (6), by appyng some of the we known and estabshed technques for sovng a genera near system of equatons. The sze of tranng data set and the constrants (6) emnate drect technques. Hence, one has to resort to the teratve approaches n sovng the probems above. There are three possbe teratve avenues that can be foowed. They are; the use of the Non-Negatve Least Squares (NNLS) technque (Lawson and Hanson, 974), appcaton of the Non-Negatve Conugate Gradent (NNCG) method (Hestenes, 980) and the mpementaton of Gauss-Sede (GS).e., the reated Successve Over-Reaxaton technque (SOR). The frst two methods sove for the non-negatve constrants ony. Thus, they are not sutabe n sovng 'soft' tasks, when penaty parameter C < s used,.e., when there s an upper bound on maxma vaue of α. Nevertheess, n the case of nonnear regresson, one can appy NNLS and NNCG by takng C = and compensatng (.e. smoothng or 'softenng' the souton) by ncreasng the senstvty zone ε. However, the two methods (namey NNLS and NNCG) are not sutabe for sovng soft margn (C < ) cassfcaton probems n ther present form, because there s no other parameter that can be used n 'softenng' the margn. Here we show how to extend the appcaton of GS and SOR to both the nonnear cassfcaton and to the nonnear regresson tasks. The Gauss-Sede method soves (7) by usng the th equaton to update the th unknown dong t teratvey,.e., startng n the k th step wth the frst equaton to compute the second equaton s used to cacuate the k α + 2 by usng new so on. The teratve earnng takes the foowng form, k α + k α +, then the k and α ( > 2) and L α α α α α α α n n k + k + k k k + k k d = / f K K K = K + K f = + = = + K = = K k + where we use the fact that the term wthn a second bracket (caed the resdua r n mathematcs' references) s the th eement of the gradent of a dua Lagrangan L d gven n (5) at the k+ th teraton step. The equaton (8) above shows that GS method s a coordnate gradent ascent procedure as we as the KA and the SMO are. The KA and SMO for postve defnte kernes equa the GS! Note that the optma earnng rate used n both the KA agorthm and n the SMO wthout-bas-term approach s exacty equa to the coeffcent /K n a GS method. Based on ths equaty, the convergence theorem for the KA, SMO and GS (.e., SOR) n sovng (5) subect to constrants (6) can be stated and proved as foows: (8)

Theorem: For SVMs wth postve defnte kernes, the teratve earnng agorthms KA.e., SMO.e., GS.e., SOR, n sovng nonnear cassfcaton and regresson tasks (5) subect to constrants (6), converge startng from any nta choce of α 0. Proof: The proof s based on the very we known theorem of convergence of the GS method for symmetrc postve defnte matrces n sovng (7) wthout constrants (Ostrowsk, 966). Frst note that for postve defnte kernes, the matrx K created by terms y y K ( x, x ) n the second sum n (), and nvoved n sovng cassfcaton probem, s aso postve defnte. In regresson tasks K s a symmetrc postve semdefnte (meanng st convex) matrx, whch after a md reguarzaton gven as (K K + λi, λ ~ e-2) becomes postve defnte one. (Note that the proof n the case of regresson does not need reguarzaton at a, but there s no space here to go nto these detas). Hence, the earnng wthout constrants (6) converges, startng from any nta pont α 0, and each pont n an n-dmensona search space for mutpers α s a vabe startng pont ensurng a convergence of the agorthm to the maxmum of a dua Lagrangan L d. Ths, naturay, ncudes a the (startng) ponts wthn, or on a boundary of, any convex subspace of a search space ensurng the convergence of the agorthm to the maxmum of a dua Lagrangan L d over the gven subspace. The constrants mposed by (6) preventng varabes α to be negatve or bgger than C, and mpemented by the cppng operators above, defne such a convex subspace. Thus, each 'cpped' mutper vaue α defnes a new startng pont of the agorthm guaranteeng the convergence to the maxmum of L d over the subspace defned by (6). For a convex constranng subspace such a constraned maxmum s unque. Q.E.D. Due to the ack of the space we do not go nto the dscusson on the convergence rate here and we eave t to some other occason. It shoud be ony mentoned that both KA and SMO (.e. GS and SOR) for postve defnte kernes have been successfuy apped for many probems (see references gven here, as we as many other, benchmarkng the mentoned methods on varous data sets). Fnay, et us ust menton that the standard extenson of the GS method s the method of successve over-reaxaton that can reduce the number of teratons requred by proper choce of reaxaton parameter ω sgnfcanty. The SOR method uses the foowng updatng rue α α ω α α α ω L n k + k k + k k d = K + K f = + K = = K k + and smary to the KA, SMO, and GS ts convergence s guaranteed. (9) 4. Concusons Both the KA and the SMO agorthms were recenty deveoped and ntroduced as aternatves to sovng quadratc programmng probem whe tranng support vector machnes on huge data sets. It was shown that when usng postve defnte kernes the two agorthms are dentca n ther anaytc form and numerca mpementaton. In addton, for postve defnte kernes both agorthms are strcty dentca wth a cas-

sc teratve GS (optma coordnate ascent) earnng and ts extenson SOR. T now, these facts were burred many due to dfferent pace n posng the earnng probems and due to the 'heavy' heurstcs nvoved n an SMO mpementaton that shadowed an nsght nto the possbe dentty of the methods. It s shown that n the so-caed no-bas SVMs, both the KA and the SMO procedure are the coordnate ascent based methods. Fnay, due to the many ways how a the three agorthms (KA, SMO and GS.e., SOR) can be mpemented there may be some dfferences n ther overa behavour. The ntroducton of the reaxaton parameter 0 < ω < 2 w speed up the agorthm. The exact optma vaue ω opt s probem dependent. Acknowedgment: The resuts presented are ntated durng the stay of the frst author at the Prof. Rof Isermann's Insttute and sponsored by the Deutsche Forschungsgemenschaft (DFG). He s thankfu to both Prof. Rof Isermann and DFG for a the support durng ths stay. 5. References. Anauf, J. K., Beh, M., The AdaTron - an adaptve perceptron agorthm. Europhyscs Letters, 0(7), pp. 687 692, 989 2. Cherkassky, V., Muer, F., Learnng From Data: Concepts, Theory and Methods, John Wey & Sons, New York, NY, 998 3. Crstann, N., Shawe-Tayor, J., An ntroducton to Support Vector Machnes and other kerne-based earnng methods, Cambrdge Unversty Press, Cambrdge, UK, 2000 4. Evgenou, T., Pont, M., Poggo, T., Reguarzaton networks and support vector machnes, Advances n Computatona Mathematcs, 3, pp.-50, 2000. 5. Freß, T.-T., Crstann, N., Campbe, I. C. G., The Kerne-Adatron: a Fast and Smpe Learnng Procedure for Support Vector Machnes. In Shavk, J., edtor, Proceedngs of the 5th Internatona Conference on Machne Learnng, Morgan Kaufmann, pp. 88 96, San Francsco, CA, 998 6. Kecman V., Learnng and Soft Computng, Support Vector Machnes, Neura Networks, and Fuzzy Logc Modes, The MIT Press, Cambrdge, MA, (http://www.support-vector.ws), 200 7. Lawson, C. I., Hanson, R. J., Sovng Least Squares Probems, Prentce-Ha, Engewood Cffs, N.J., 974 8. Ostrowsk, A.M., Soutons of Equatons and Systems of Equatons, 2 nd ed., Academc Press, New York, 966 9. Patt, J. C., Sequenta mnma optmzaton: A fast agorthm for tranng support vector machnes. TR MSR-TR-98-4, Mcrosoft Research, 998 0. Patt, J.C., Fast Tranng of Support Vector Machnes usng Sequenta Mnma Optmzaton. Ch. 2 n Advances n Kerne Methods Support Vector Learnng, edted by B. Schökopf, C. Burges, A. Smoa, The MIT Press, Cambrdge, MA, 999. Schökopf B., Smoa, A., Learnng wth Kernes Support Vector Machnes, Optmzaton, and Beyond, The MIT Press, Cambrdge, MA, 2002 2. Veropouos, K., Machne Learnng Approaches to Medca Decson Makng, PhD Thess, The Unversty of Brsto, Brsto, UK, 200 3. Vapnk, V.N., The Nature of Statstca Learnng Theory, Sprnger Verag Inc, New York, NY, 995 4. Vogt, M., SMO Agorthms for Support Vector Machnes wthout Bas, Insttute Report, Insttute of Automatc Contro, TU Darmstadt, Darmstadt, Germany, (http://w3.rt.e-technk.tu-darmstadt.de/~vogt/), 2002