Speeding up the IRWLS convergence to the SVM solution

Size: px

Start display at page:

Download "Speeding up the IRWLS convergence to the SVM solution"

Shawn Sutton
5 years ago
Views:

1 Speedng up the IRWLS convergence to the SVM soluton Fernando Pérez-Cruz Gatsby Computatonal Neuroscence Unt Unversty College London Alexandra house 7 Queen Square London, WCN 3AR, Unted Kngdom E-mal: fernando@gatsby.ucl.ac.u Antono Artés-Rodíguez Sgnal Theory and Communcatons Department Unversty Carlos III n Madrd Avda. Unversdad Leganes (Madrd) Span E-mal: antono@eee.org Abstract We present the convergence demonstraton of the Iteratve Re-Weghted Least Squares (IRWLS) procere to the SVM soluton, to propose two modfcatons, whch sgnfcantly reces the runtme complexty of the IRWLS. We show by means of computer experments that the convergence can be speed up between two and eght tmes compare to the standard IRWLS procere. I. INTRODUCTION Support vector machnes (SVMs) are state-of-the-art tools for lnear and nonlnear nput-output nowledge dscovery ], 2]. The SVM reles on the mnmzaton of a quadratc problem, whch s frequently solved usng Quadratc Programmng (QP) 3]. The Iteratve Re-Weghted Least Square (IRWLS) procere for solvng SVMs for classfcaton was frst ntroced n 4], 5] and t was used n 6] to construct the fastest SVM solver of the tme. It solves a sequence of weghted least square problems that, unle other least square proceres such as Lagrangan SVMs 7] or Least Square SVMs 8], leads to the true SVM soluton as we have already proven n 9], where we needed a slght modfcaton from the formulaton that appears n 4], 5]. In ths paper, we are gong to use the proposed proof of convergence to modfy the IRWLS procere to speed up ts convergence. Ths modfcaton s plausble because the IR- WLS s based on an approxmaton to the SVM loss functon, whch s not very accurate. The proposed approxmatons are quadratc as well, so the nature of the IRWLS algorthm wll not be sgnfcantly modfed. The rest of the paper s organzed as follows. We show the standard IRWLS procere for solvng the SVM n Secton II, together wth the outlne of the proof of convergence. In Secton III, we propose two modfcatons to the loss functon of the IRWLS procere. We demonstrate n Secton IV, by means of computer experments, the advantages of the proposed modfcatons compared to the standard IRWLS procere. We conclude the paper wth some fnal remars n Secton V. II. IRWLS ALGORITHM FOR SUPPORT VECTOR CLASSIFIERS The support vector classfer (SVC) sees to compute the dependency between a set of patterns x R d ( =,...,n) and ts correspondng labels y ±}, gven a transformaton to a feature space φ( ) (R d φ(.) R H and d H). The SVC solves subject to: mn w,ξ,b 2 w 2 + C } ξ = y (φ T (x )w + b) ξ ξ where w and b defne the lnear classfer n the feature space (nonlnear n the nput space, unless φ(x) = x) and C s the penalty appled over tranng errors. Ths problem s equvalent to the followng unconstraned problem, n whch we need to mnmze L P (w,b) = 2 w 2 + C L(u ) = wth respect to w and b, where u = y (φ T (x )w + b) and L = max(u,). To prove the convergence of the algorthm, we need L P (w,b) not only to be contnuous but also dfferentable, therefore we would replace L by a convex approxmaton:, u < L = Ku 2 /2, u < /K u /(2K), u /K ( whch tends to max(u, ) ) as K approaches nfnty lm L = max(u,). K As the problem s convex, the SVM soluton s acheved at ()

2 w and b that maes the gradent vansh: ] L P (w,b w L ) = P (w,b ) b L P (w,b = ) w dl C φ(x )y = u = dl C y = = u ] = (2) where u = y (φ T (x )w + b ). Optmzaton problems are solved usng teratve proceres that, n each teraton, reles n the prevous soluton (w and b, n our case) to obtan the followng one, untl the optmal soluton has been reached. To construct the IRWLS procere, we modfy (II) usng a frst order Taylor expanson of L over the prevous soluton, leadng to: L P(w,b) = 2 w 2 + C = L( ) + dl u ] where = y (φ T (x )w + b ), L P (w,b ) = L P (w,b ) and L P (w,b ) = L P (w,b ). Now, we construct a quadratc approxmaton mposng that L P (w,b ) = L P (w,b ) and L P (w,b ) = L P (w,b ), leadng to: L P(w,b) = = 2 w 2 + C = 2 w where = L( ) + dl a u 2 + d = 2 w = (u ) 2 ( )2 = 2 L (3) = a = C dl, < = KC, < K C, K, u d = < K CK K, K and L s a qudratc aproxmaton to L n (). The IRWLS procere conssts n: mnmzng (3), whch s a regularzed least square functonal; recomputng a wth the obtaned soluton, and; contnue teratng untl the SVM soluton has been reached. The soluton to (3) can be readly obtaned by equatng to zero ts partal dervatves wth respect to w and b: L P (w,b) w L P (w,b) b = w = φ(x )y a ( y (φ T (x )w + b)) = = y a ( y (φ T (x )w + b) = (4) = Ths can be wrtten, more convenently, n matrx form: Φ T D a Φ + I Φ T ] ] a w Φ T ] D a T Φ a T = a y b a T y (5) where Φ = φ(x ),φ(x 2 ),...,φ(x n )] T, y = y,...,y n ] T, a = a,...,a n ] T, (D a ) j = a δ j (,j =,...,n), I s the dentty matrx and s a column-vector of n ones. Ths system can be solved usng ernels, as well as the regular SVM, by mposng that w = φ(x )y α and α y =. These condtons can be obtaned from the regular SVM soluton (KKT condtons), see 2] for further detals. The system n (5) becomes ] ] H + Da y y T ] α b = where (H) j = (x,x j ) = φ T (x )φ(x j ) and (, ) s the ernel of the nonlnear transformaton φ( ) 2]. The steps to derve (6) from (5) can be found n 5]. ( Once we have solved (6), we can compute u = n ) y j= y jα j (x j,x ) + b and recalculate the weghts a untl the algorthm has converge. A. Convergence of the IRWLS to the SVC soluton To prove that the IRWLS actually delvers the SVC soluton when t stops, we wll need demonstrate the followng tems: the sequence (w,b ),...,(w,b ),... converges to (w op,b op ); and w op = w and b op = b. Frst, we need to show that the sequence of ntermedate solutons has as lmt pont the optmal soluton. Lne search algorthms, for advancng towards the optmum, loo n the mnmzng functonal for a descendng drecton, p, and modfes the prevous soluton, z an amount η to obtan the followng one, z + = z + η p. Wolfe condtons ] ensure that lne search methods mae suffcent progress n each teraton, so the lmt pont s reached wth any requred precson, beng: L P (z + η p ) L P (z ) + c L P (z ) T p L P (z + η p ) T p c 2 L P (z ) T p for < c < c 2 <. Wolfe condtons can be appled to the IRWLS procere, because we can descrbe t as a lne search method, where z = (w ) T b ] T, p = (w s w ) T (b s b )] T, where w s and b s represent the mnmum at each step of the weghted least square problem n (3), the soluton to the lnear system of equatons n (5). We would now outlne the most relevant steps of the proof of convergence, the actual proof can be found n 9]. The frst condton can be rewrtten as follows L P (z + η p ) < L P (z ), whch s nown as the strctly decreasng property. We can demonstrate that the IRWLS procere fulflls ths property by notng that: L P (z ) = L P(z ) L P(z + ) L P (z + ) The equalty holds by constructon of L P and the frst nequalty holds for η,] because z + s a convex combnaton of the actual value z and the mnmum of L P, whch s a convex functonal. And, t would hold strctly, f (6)

3 2.5.5 L L for = L for < u Fg.. The sold lne represents the actual SVM loss-functon L. The dash-dotted and the dashed lnes represent, respectvely, L for = and for <. η > and z s z (If z s = z, then we have attaned the SVM soluton, as we wll show at the end of ths secton). To show that the second nequalty holds, t s suffcent to show that L(+ ) L (+ ) for all =,...,n. In Fgure we have plotted L and L for = and <. From ths plot one can easly understand that L L for any u f and that L = L for any u f <. Therefore, the suffcent condton (t s not necessary) to ensure the strctly decreasng property s that +, f ts correspondng was less than. As u depends lnearly on w and b, we can fnd the largest η, whch ensures L(+ ) L (+ ) for all =,...,n, by settng t equal to η = mn S /(u us ), where S = u < & u s > }. If S s empty them η =. It can be seen that to ensure the convergence of the IRWLS η cannot always be equal to one and, n some teratons, t needs to be restrcted to ensure the strctly decreasng property. Ths s the needed modfcaton to the orgnal IRWLS procere to ensure the convergence of the algorthm. The second condton can be rewrtten as L P (z + ) T p > L P (z ) T p and t s nown as the suffcent decreasng property, because t ensures that the optmum can be found wth any requred precson n a fnte number of steps. After some nontrval algebrac manpulatons, detaled n 9], we can rewrte: L P (z + ) T p = and = w+ w 2 /2 + L P (w +,b + ) L P (w,b ) η L P (z ) T p = = w+ w 2 /2 L P (w,b ) + L P (w+,b + ) η where we have defned: L P (w,b) = 2 w 2 +C = L(+ )+ dl u + u + ] whch s equvalent to L P (w,b) but defned nstead over the actual soluton. Beng L P (w,b) convex, t can be readly seen that L P (w,b) L P (w,b) and L P(w,b) L P (w,b) w R H and b R. As η s postve for every teraton, we need to show that w + w 2 +L P (w +,b + ) L P (w+,b + )]+ L P (w,b ) L P (w,b )] >. The terms L(w +,b + ) L (w +,b + ) and L(w,b ) L (w,b ) are equal or greater than zero by constructon. Moreover, w + w 2 and t s only zero f w + = w, therefore f we are not at the soluton L P (z + ) T p > L P (z ) T p. Fnally, we would need to prove that the lmt soluton descrbed by the IRWLS procere correspond to the SVM soluton. The IRWLS procere stops when w s = w and b s = b, f we replace them n (4) we are lead to: w s C dl φ(x )y = ( y (φ T (x )w s + b s )) C dl y = ( y (φ T (x )w s + b s ) w s dl C φ(x )y ] = u s = dl C y = (7) = whch s equal to (2), consequently the IRWLS algorthm stops when t has reached the SVM soluton. To proof the suffcent condton, we need to show that f w = w and b = b the IRWLS has stopped. Suppose t has not, we can fnd w s w and b s b such that L P (w,b ) > L P (ws,b s ), and the strctly decreasng property wll lead to L P (w,b ) > L P (w s,b s ), whch s a contradcton because w and b gve the mnmum of L P (w,b). We have just proven that f the IRWLS has stopped we wll be at the SVM soluton and f we are at the SVM soluton the IRWLS has stopped, whch ends the proof of convergence. u s III. NOVEL QUADRATIC APPROXIMATIONS In the lght of the proof of convergence, two modfcatons can be proposed to speed up ts convergence. The convergence speed depends on how accurate the approxmaton to the SVM loss functon s. Therefore, f we are able to proposed tghter approxmatons; we are gong to converge faster to the SVM soluton than the regular IRWLS procere does. Our frst proposal, L, would be such that t s always greater or equal than L n (), so we can always tae a full step of the IRWLS procere (η = ), unless a non support vector sample gets a u s >. We would use a quadratc approxmaton (L = au 2 /2 + tu + d) to stll rely on the IRWLS procere to obtan the SVM soluton. To ensure that the IRWLS procere stops when t has reached the SVM To ensure the equalty between (2) and (7) we need L to be dfferentable, whch justfes the modfcaton n ().

4 soluton we need to enforce that L ( ) = L( ) and that dl = dl Therefore, we need that. and a 2 u2 + t + d =C C 2K a 2 u2 + t + d = Ku2 2 a + t =C K K < K (8) f there were any nonsupport vector that presented a postve error ( < and us > ). Wth ths loss functon we would also need to chec t the other way around,.e. > /K and u s <. The condton we need to set a s: L 2() = for > /K, therefore d =. In ths case the coeffcent of the L 2 are:, <, u a = CK, < < K K t = C K 2, C Ku, K K K a + t =K < K (9) 2 For the case n whch,/k), a = K, t = and d =. And t = C a and d = au2 K C for /K. 2K Now, we need to fnd a value for a that ensures L L, for /K 2. Ths can be easly done by fndng a u such that L (u ) = and dl =, then L u would be greater or equal than L for any u. To fnd u and a, we need to solve: a 2 u 2 + (C a )u + au K C = 2K () au + (C a ) = () beng a = CK 2 and u = au C = +. It can be a K readly seen that u s less or equal than zero for /K. Now we can defne the coeffcent of the L approxmaton as follows:, <, u < K a = CK, d = CK 2K, < K u K, u < K C (Ku )2 2K(2K ), u K t = u = C Ku 2K, u K, < K K K, u K where we have also ncluded the case for whch < and we ndcate wth the subscrpt that each sample has ts own approxmaton. Now we are gong to construct a more accurate quadratc approxmaton, L 2 = au 2 /2 + tu + d, so we would be allowed a faster convergence. To get a better approxmaton around the actual value of, we are not gong to force L 2 to be equal or greater than L for any u. In ths case, we mght have to select a η < n every teraton, but f t guarantees a faster convergence, t mght be worth to pay the computatonal burden prce. The condtons n (8) and (9) stll need to hold to ensure the stoppng condtons. Now to get a tghter soluton we are gong to allow L 2 became negatve for u <, f > /K. Usng the prevous approxmatons (L or L ), we need to test for a η less than one, 2 For, /K), the prevous condtons are suffcent L L L L u Fg. 2. We show four curves. The sold one represents the SVM loss functon n (). The dash-dotted lne represents L. The dashed and dotted lnes represent, respectvely, L and L 2. The curves have been computed for = and K =, usually K would be much hgher, between 4 and 2 and, n ths case, L 2 would be ndstngushable from a straght lne. We have plotted n Fgure 2 the approxmatona to the SVM loss functon obtaned usng L and L 2, together wth the approxmaton proposed by the standard IRWLS procere L. One can notce that the proposed approxmaton L s tghter and that there s not another quadratc approxmaton greater than L, that s more accurate. The approxmaton provded by L 2 s very tght around the value. Its mayor drawbac s that we need to compute η n almost every teraton, whle wth the other approxmatons very rarely a sample that was dscarded as a support vector become one agan. The IRWLS procere wors as the one presented n the prevous secton. The only needed modfcaton s to consder a nonzero t, whch would be added to the ndependent term n the lnear system of equatons. For the 3 approxmatons, we need to solve: L P (w,b) = 2 w 2 + L l (u ) l =, or 2. = to get w s and b s. Ths can be solved equatng to zero ts partal dervatves: Φ T D a Φ + I Φ T ] ] a w Φ T ] D a T Φ a T = y (a + t) b y T (a + t)

5 where we have done the same algebrac transformatons we dd to transform (4) nto (5), and we have defned t = t,...,t n ] T. Ths system can be solved as well usng ernels, leadng to: ] ] H + Da y y T ] α b = + ta (2) where ta = t /a,...,t n /a n ] T. The Iteratve Re-Weghted Least Square (IRWLS) wth the proposed loss functons can be summarzed n the followng steps: ) Intalzaton: set =, α =, b = and u = =,...,n. 2) Solve (2) to obtan α s and b s. 3) Compute u s and construct S. If S =, set α+ = α s and b + = b s and go to 5. 4) Compute η = arg mn η Sη L(ηα + ( η)α s,ηb + ( η)b s ), and set α + = η α + ( η )α s and b + = η b + ( η )b s. 5) Set = + and go to 2 untl convergence. Ths algorthm can be used for the three proposed approxmatons wth mnor modfcatons. For the standard IRWLS approxmaton, ntroced n Secton II, n the second step we need to solve (6) nstead of (2),.e. set ta =. The set S = < & u s > } for L and L and for L 2 t s equal to S = ( < & us > ) (u > /K & u s < )}. Fnally the set: } u S η = us S } The mnmzaton n the forth step can be carred out very easly, because as we are usng a solvng a convex functonal wth a convex combnaton over a fnte set. We only need to test the value of L P for the dfferent values of η n S η. Furthermore, we do not need to evaluate every value of η, we just need to start from the smallest (largest) η and contnue evaluatng L P untl a mnmum s found, whch s optmum e to the convexty of L P. IV. EXPERIMENTS In ths secton we are gong to test the proposed new loss functons for the IRWLS procere aganst the standard loss functon used by t. We are gong to test t over 6 dfferent bnary classfcaton problems. We have taen the data sets: Rngnorm, Banana, Twonorm, Breast-Cancer, German and Thyrod from G. Raestch web page (htt://mlg.anu.e.au/ raetsch/), whch are normalzed to present zero mean and unt standard devaton. In Table I we have summarzed the data sets most relevant features and the tranng parameters, whch have been chosen to mnmzed the test error over a valdaton set. We have solved the IRWLS wth the 3 loss functons over the proposed data sets, we have carred out 2 dfferent smulaton usng the frst 2 fles provded by G. Raestch (he has created tranng and testng fles from the avalable samples for each problem.) We present the tranng tme and TABLE I FEATURES OF THE USED DATA SETS TOGETHER WITH THE PARAMETERS USED FOR TRAINING THE SVM WITH AN RBF KERNEL. Data set Number of patterns Input Dm. C σ Banana Breast-Cancer German Rngnorm Thyrod Twonorm TABLE II MEAN COMPUTING TIME AND STANDARD DEVIATION FOR 2 TRIALS OVER EACH DATA SET FOR EACH LOSS FUNCTION APPROXIMATION. Data set L L L 2 Banana 8. ± ±..3 ±.2 Breast-Cancer.87 ± ± ±.2 German 4.95 ± ± ± 2.95 Rngnorm 2.88 ±..79 ±..78 ±.43 Thyrod.25 ±.7.3 ±.5.7 ±. Twonorm 2.2 ±.8.84 ±.7.22 ±.4 the number of teraton of the IRWLS procere, respectvely, n Table II and III. The frst result that stres from these tables s that the two approxmatons are better n every case than the standard IRWLS-SVM. In some experments usng L s best and n others L 2 provdes the lowest runtme complexty. Also, one can notce that for smlar number of teratons usng L s better than usng L 2, ths s e to the mnmzaton n the fourth step of the IRWLS procere, whhc has to be carred out more frequently when we use L 2. Also, we can notce that usng L 2 s sgnfcantly better for the two data sets wth lowest nput dmenson. Ths result can be justfed because we have more data ponts per dmenson and the gradent of the SVM s more accurate and, as the second novel approxmaton s bascally descendng wth the SVM gradent, s able to converge n fewer steps. When there are more dmensons ths gradent s not so relable and usng L 2 wll not provde a great mprovement or even usng L can provde faster convergence. We have fnally plotted n Fgure 3 the value of TABLE III MEAN NUMBER OF ITERATIONS AND STANDARD DEVIATION FOR 2 TRIALS OVER EACH DATA SET FOR EACH LOSS FUNCTION APPROXIMATION. Data set L L L 2 Banana 49.2 ± ± ± 9.4 Breast-Cancer 36.8 ± ± ±.2 German 2.8 ± ± ± 246. Rngnorm 3.5 ± ± ± 2. Thyrod 23.3 ± ± ± 5.3 Twonorm 8.9 ± ± ±.8

6 L P (w,b ) L P (w,b ) for each teraton of the IRWLS for one of the trals n the banana data set. In ths plot we can see how the change of the loss functon ncreases the speed of convergence towards the SVM soluton. We have also plotted n Fgure 4 the value of η n each teraton. It can be seen that the loss functon L 2 needs to compute η n almost all the teratons whle the other two approxmatons seldom needs to compute the value of η, whch explans why for smlar number of teratons usng L provdes a faster convergence than usng L L L L Number of Iteratons Fg. 3. We have plotted L P (w, b ) L P (w, b ) for the used approxmatons for the tenth tral of the banana data set L L L Number of Iteratons Fg. 4. We have plotted the value of η for L (sold), L (dashed), and L 2 (dash-dotted) for the tenth tral of the banana data set. V. DISCUSSION In ths paper, we have been able to explot the demonstraton of convergence of a nown algorthm to be able to mprove ts speed of convergence, provdng two new approxmatons that were better than the prevous one. None of the proposed approxmatons seem superor to the other. Probably, the best would be a mxed strategy, use L for the frst teratons, so we now that seldom we would need to compute η, and once the values of the u are not changng sgnfcantly, change to L 2, whch wll gve a faster convergence, because t s a tghter approxmaton to the SVM loss functon around. The valdty of ths combnaton s left as future wor. Other relevant ssue we have not address n ths paper s the SVM tranng when the ernel matrx cannot be stored n memory. In ths case, one would need to come up wth a chunng scheme, as proposed n ], whch the most wdely used mplementatons are SV M lght 2] and SMO 3]. We have already compared the standard IRWLS wth SV M lght n 6] and show that the IRWLS-based chunng scheme was sgnfcantly faster. Therefore, we can expect a larger mprovements, f these two approxmatons were used. But, t s also mportant to notce that when solvng large scale SVMs (over samples) the mayor computatonal burden s e to the computaton of the ernel matrx. In ths case, t s more relevant to decde whch samples should be used n each teraton of the chunng scheme to rece the number of ernel computaton, than the actual solver. Therefore, the proposed modfcatons wll sgnfcantly mprove the runtme complexty when the ernel matrx can be computed and stores n memory or for mem scale problems ( to samples), n whch the solver for each chun taes most of the computatonal burden of the whole learnng procere. ACKNOWLEDGEMENTS Ths wor has been partally supported by Grants: CAM 7T/6/23 and CICYT TIC Fernando Pérez- Cruz s Supported by the Spansh Mnstry of Ecaton Postdoctoral fellowshp EX REFERENCES ] V. N. Vapn, Statstcal Learnng Theory, Wley, New Yor, ] B. Schölopf and A. Smola, Learnng wth ernels, M.I.T. Press, 2. 3] C. J. C. Burges, A Tutoral on Support Vector Machnes for Pattern Recognton, Knowledge Dscovery and Data Mnng, vol. 2, no. 2, pp. 2 67, ] F. Pérez-Cruz, A. Nava-Vázquez, J. L. Rojo-Álvarez, and A. Artés- Rodríguez, A new tranng algorthm for support vector machnes, n Proceedngs of the Ffth Bayona Worshop on Emergng Technologes n Telecommuncatons, Baona, Span, Sept. 999, pp ] F. Pérez-Cruz, A. Nava-Vázquez, P. L. Alarcón-Dana, and A. Artés- Rodríguez, SVC-based equalzer for burst TDMA transmssons, Sgnal Processng, vol. 8, no. 8, pp , Aug. 2. 6] F. Pérez-Cruz, P. L. Alarcón-Dana, A. Nava-Vázquez, and A. Artés- Rodríguez, Fast tranng of support vector classfers, n Advances n Neural Informaton Processng Systems 3, Nov. 2, M.I.T. Press. 7] O. L. Mangasaran and D. R. Muscant, Lagrangan support vector machnes, Journal of Machne Learnng Research, pp. 6 77, 2. 8] J. A. K. Suyens and J. Vandewalle, Least squares support vector machne classfers, Neural Processng Letters, vol. 9, no. 3, pp , ] F. Pérez-Cruz, Carlos Bousoño-Calzón, and Antono Artés-Rodríguez, Convergence of the rwls procere to the support vector machne soluton, Neural Computaton, Submtted. ] J. Nocedal and S. J. Wrght, Numercal Optmzaton, Sprnger, 999. ] E. Osuna and F. Gros, Recng run-tme complexty n SVMs, n Proceedngs of the 4th Internatonal Conf. on Pattern Recognton, Brsbane, Australa, Aug ] T. Joachms, Mang large scale SVM learnng practcal, n Advances n Kernel Methods Support Vector Learnng, B. Schölopf, C. J. C. Burges, and A. J. Smola, Eds., pp M.I.T. Press, ] J. C. Platt, Sequental mnmal optmzaton: A fast algorthm for tranng suppor vector machnes, n Advances n Kernel Methods Support Vector Learnng, B. Schölopf, C. J. C. Burges, and A. J. Smola, Eds., pp M.I.T. Press, 999.

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest