Sequential Minimal Optimization for SVM with Pinball Loss

Size: px

Start display at page:

Download "Sequential Minimal Optimization for SVM with Pinball Loss"

Erin Hines
5 years ago
Views:

1 Sequental Mnal Optzaton for SVM wth Pnball Loss Xaoln Huang a,, Le Sh b, Johan A.K. Suykens a a KU Leuven, Departent of Electrcal Engneerng (ESAT-STADIUS), B-300 Leuven, Belgu b School of Matheatcal Scences, Fudan Unversty, Shangha, , P.R. Chna Abstract To pursue the nsenstvty to feature nose and the stablty to re-saplng, a new type of support vector achne (SVM) has been establshed va replacng the hnge loss n the classcal SVM by the pnball loss and was hence called a pn-svm. Though a dfferent loss functon s used, pn-svm has a slar structure as the classcal SVM. Specfcally, the dual proble of pn-svm s a quadratc prograg proble wth box constrants, for whch the sequental al optzaton (SMO) technque s applcable. In ths paper, we establsh SMO algorths for pn-svm and ts sparse verson. The nuercal experents on real-lfe data sets llustrate both the good perforance of pn-svms and the effectveness of the establshed SMO ethods. Keywords: support vector achne, pnball loss, sequental al optzaton. Introducton Snce proposed n [] [2], the support vector achne (SVM) has been wdely appled and well studed, because of ts fundaental statstcal property and good generalzaton capablty. The basc dea of SVM s to axze the argn between two classes by zng the regularzaton ter. The argn s classcally related to the closest ponts of two sets, snce the hnge loss s zed. For a gven saple set z = x, y } =, where x R n, y, +}, the SVM wth the hnge loss (C- SVM) n the pral space has the followng for, w,b 2 w C = L hnge ( y (w T φ(x ) + b) ), () where φ(x) s a feature appng, L hnge (u) = ax0, u} s the hnge loss, and C s the trade-off paraeter between Ths work was supported: EU: The research leadng to these results has receved fundng fro the European Research Councl under the European Unon s Seventh Fraework Prograe (FP7/ ) / ERC AdG A-DATADRIVE-B (290923). Ths paper reflects only the authors vews, the Unon s not lable for any use that ay be ade of the contaned nforaton. Research Councl KUL: GOA/0/09 MaNet, CoE PFV/0/002 (OPTEC), BIL2/T; PhD/Postdoc grants Flesh Governent: FWO: projects: G (Structured systes), G.0884N (Tensor based data slarty); PhD/Postdoc grants IWT: projects: SBO POM (0003); PhD/Postdoc grants Mnds Medcal Inforaton Technologes SBO 204 Belgan Federal Scence Polcy Offce: IUAP P7/9 (DYSCO, Dynacal systes, control and optzaton, ). L. Sh s also supported by the Natonal Natural Scence Foundaton of Chna (20079) and the Fundaental Research Funds for the Central Unverstes of Chna ( , ). Johan Suykens s a professor at KU Leuven, Belgu. Correspondng author. Eal addresses: huangxl06@als.tsnghua.edu.cn (Xaoln Huang), lesh@fudan.edu.cn (Le Sh), johan.suykens@esat.kuleuven.be (Johan A.K. Suykens) the argn wdth and sclassfcaton loss. Snce the dstance between the closest ponts s easly affected by the nose on feature x, the classfer traned by C-SVM () s senstve to feature nose and unstable to re-saplng. Ths phenoenon has been observed by any researchers and soe technques have been desgned, see, e.g., [3] [7]. An attractve ethod for enhancng the stablty to feature nose s to change the closest dstance easureent to the quantle dstance. However, axzng the quantle dstance s non-convex. The well-known ν-support vector achne (ν-svm, [8]) can be regarded as a convex approach for axzng the quantle dstance and has been successfully appled. In ν-svm, the argn between the surfaces x : yf(x) = ρ} s axzed. Mnzng the hnge loss together wth an addtonal ter νρ pushes ρ to be the quantle value of y f(x ) and the quantle level s controlled by ν. Recently, we establshed a new convex ethod n [9] by extendng the hnge loss n C-SVM to the pnball loss. The pnball loss L τ (u) s defned as u, u 0, L τ (u) = τu, u < 0, whch can be regarded as a generalzed l loss. Partcularly, when τ = 0, the pnball loss L τ (u) reduces to the hnge loss. When a postve τ s used, zng the pnball loss results n the quantle value. Ths lnk has been well studed n quantle regresson, see, e.g., [0] []. Motvated by ths lnk, the pnball loss wth a postve τ value was appled n classfcaton tasks and the related classfcaton ethod can be forulated as, w,b 2 w C ( L τ y (w T φ(x ) + b) ), (2) = Preprnt subtted to Elsever Septeber 8, 204

2 whch s called a support vector achne wth the pnball loss (pn-svm, [9]). Unlke ν-svm, pn-svm pushes the surfaces that defne the argn to quantle postons by penalzng also the correctly classfed saplng ponts. In classfcaton tasks, the pnball loss L τ has been proved to be calbrated,.e., the zer of the pnball loss has the sae sgn as Proby = + x} Proby = x}. The prelary experents reported n [9] llustrate the stablty to feature nose of pn-svm. A odel called sparse pn-svm has been establshed for enhancng the sparseness. The sparsty s obtaned by ntroducng the ε-zone to the pnball loss, whch results n the pnball loss wth an ε nsenstve zone, denoted by L ε τ (u): L ε τ (u) = u ε, u > ε, 0, ε τ u ε, τ(u + ε τ ), u < ε τ. (3) When a tranng pont falls nto the nterval [ ε τ, ε], the correspondng dual varable s zero. In Fg., we plot L ε τ(u) for several τ and ε values. When ε = 0, L ε τ(u) reduces to the pnball loss. Furtherore, f τ = 0, t reduces to the hnge loss. L ǫ τ(u) τ = 0.3, ε = 0 τ = 0., ε = u τ = 0, ε = 0 τ = 0.3, ε = 0.2 Fgure : The plots of the pnball loss wth an ε nsenstve zone. τ = 0, ε = 0 corresponds to the hnge loss and s dsplayed by the sold lne. When ε = 0, L ε τ (u) reduces to the pnball loss, as shown by the dashed lnes. The dotted lne gves the case τ = 0.3, ε = 0.2. Wth properly selected paraeters, pn-svms can perfor better than C-SVM. However, pn-svms currently lack fast tranng algorths, whch s the target of ths paper. Generally, we wll tran pn-svms n the dual space by sequental al optzaton (SMO). SMO s one of the ost popular ethods for solvng SVMs n the dual space. SMO s a knd of decoposton ethod and always uses the sallest possble workng set, whch contans two dual varables and can be updated very effectvely. For C-SVM, the correspondng SMO algorths can be found n [2] [7]. The convergence behavor of SMO has been also well studed n [8] [22]. In the followng, we wll frst nvestgate the dual proble of pn-svm and establsh a SMO ethod n Secton 2. Secton 3 gves the SMO algorth for sparse pn-svm. After that, we use the establshed SMO algorths to tran pn-svms on soe real-lfe probles n Secton 4. The nuercal experents confr the good property of pn- SVM wth the proposed ethods, whch wll be prosng tools n any applcatons, as suarzed n Secton Sequental Mnal Optzaton for pn-svm 2.. Dual proble of pn-svm The dual proble of pn-svm has been dscussed n [9]. In the followng, we wll frst ntroduce the dual proble and then nvestgate the proble structure. In the pral space, pn-svm (2) can be wrtten as the followng constraned quadratc prograg (QP) proble, w,b,ξ 2 wt w + C ξ = y [ w T φ(x ) + b ] ξ, =,...,, (4) y [ w T φ(x ) + b ] + τ ξ, =,...,, where C could be dfferent for dfferent observatons. The value of C s the weght on the loss related to (x, y ) and one can consder any pacts when settng t. For exaple, f (x, y ) s an outler or s heavly nose-polluted, one should choose a sall C. One notceable stuaton s the unbalanced probles, for whch the nubers of postve and negatve labels are not the sae. In ths case, we prefer to the followng typcal settng, C = C 0, : y =, C = #j:yj= #j:y j= C 0, : y =, where C 0 > 0 s a user-defned constant. In ths paper, we always use ths settng, whch gves equal weghts to both classes. The algorths proposed n the rest of the paper also work for other paraeter settngs. One can choose sutable C accordng to dfferent applcatons and pror knowledge. We ntroduce the Lagrange ultplers α, β 0, whch correspond to the constrants n (4). These varables should satsfy the followng copleentary slackness condton, α ( ξ y [ w T φ(x ) + b ]) = 0, =, 2,...,, [ β (y w T φ(x ) + b ] ) τ ξ = 0, =, 2,...,. (5) Consderng the Lagrangan of (4) and KKT condton, we get the followng dual proble for pn-svm, α,β 2 = j= (α β )y K j y j (α j β j ) (α β ) = y (α β ) = 0 (6) = α + τ β = C, =, 2,...,, α 0, β 0, =, 2,...,,

3 where K corresponds to a postve defnte kernel wth K j = K(x, x j ) = φ(x ) T φ(x j ). After obtanng the soluton of (6), we use the sgn of the followng functon to do classfcaton: f(x) = y (α β )K(x, x ) + b, = where b s coputed accordng to the copleentary slackness condtons y f(x ) =, j : α j 0, β j 0}. We further ntroduce = α β and elate the equalty constrant α + τ β = C. Then the equvalent forulaton of (6) can be posed as F() = 2 y K j y j j = j= = y = 0, (7) = τc C, =, 2,...,. We agan observe the relatonshp between pn-svm and C-SVM n the dual space: pn-svm wth τ = 0 reduces to C-SVM. The optzaton proble (7) s a quadratc prograg wth box constrants. Therefore, we can update a part of the dual varables and keep the others unchanged,.e., the sequental zaton optzaton (SMO, [2] [7]) s applcable to tran pn-svm (7). The constrant τc C can be equvalently transfored nto where A = τc, y =, C, y =, A y B, B = C, y =, τc, y =. For a gven, the ndces are dvded nto the followng two sets, I up = : y < B } and I = : y > A }. The subscrpts of the two sets ply that for a par of observatons Iup, j I, one can always fnd a sall postve scalar t such that the odfed soluton + t, j t reans feasble. Therefore, f s an optzer, the followng nequalty should be et where g = y y g y j g j, y j j K j j= stands for the dervatves of the objectve functon of (7) wth respect to α. Otherwse, f y g < y j gj, we can 3 update and j to obtan a strct decrease on the objectve value of (7). Snce the above nequalty holds for any I up () and j I (), a necessary condton of beng optal (7) can be wrtten as: and y = 0, = ρ R such that ax y g Iup ρ j I y j g j. (8) The correspondng condton for C-SVM has been wdely appled n the SMO technque, see, e.g., [20] and [4]. When τ vares, Iup and I are dfferent Dual varable update Sequental al optzaton starts fro an ntal feasble soluton of (7) and updates untl (8) s satsfed. The basc dea of SMO s that we only update the dual varables n a workng set and leave the other varables unchanged. The extree case s that only two varables are nvolved n each teraton, whch follows that there exsts an explct update forulaton. Denote the current soluton by old. Wthout any loss of generalzaton, we assue that Iup old, j I old are the varables n the workng set. That eans that the two eleents volate the optalty condton (8),.e., y g old > y j g old j. (9) Denote u j for a vector of whch the -th coponent s y, the j-th coponent s y j, and the others are zero. Then searchng along u j wll brng the proveent for (7). Specfcally, old + ζu j wth a suffcently sall postve ζ > 0 wll be stll feasble to (7). Moreover, F( old + ( ζu j ) F( old ) = ζ y g old y j gj old ) ζ2 2 (K + K jj 2K j ). (0) Fro ths forulaton and (9), we know that the objectve functon of (7) can be decreased strctly. The best ζ whch gves the largest decrease of the objectve functon s the zer of the followng proble, ζ 0 ζ (y g old y j g old j ζ 2 2 (K + K jj 2K j ) y g old + ζ B, y j g old j ζ A j. ) + For ths -densonal QP, the optal soluton can be explctly gven by ζ = B y g old, y j gj old A j, y } g old y j gj old. K + K jj 2K j

4 Correspondngly, the dual varables are updated to new = old + ζy and new j = old j ζy j. At the sae te, the gradent vector s updated to g new l = g old l ζy l K l + ζy l K jl, l =, 2,..., Workng set selecton and ntal soluton Above we dscussed the update process for pn-svm when Iup old, j Iold are chosen n the workng set. Before establshng the SMO for pn-svm, we frst consder the workng set selecton and ntal soluton generaton. The objectve functon of pn-svm (7) s the sae as that of C-SVM. Thus, the strateges of selectng two dual varables for C-SVM are applcable to pn-svm. The splest selecton s the axal volatng par, whch has been dscussed n [20]. For the current soluton old, we choose and j as = arg ax y l gl old l Iup old and j = arg y l gl old. () l I old Ths strategy s essentally the greedy choce based on the frst order approxaton of F( old +ζu j ) F( old ). One can also consder the second order workng set selecton proposed by [3]. That ethod s based on the second order expanson (0). Ths quadratc gan should be axzed wth the lnear constrants. To quckly and heurstcally fnd a good drecton, we gnore the constrant and then can fnd the axal gan easly: (y g old y j gj old ) 2 2(K + K jj 2K j ). (2) One can choose, j by axzng (2) but t needs parwse coparson. Instead, we frst use () to fnd and then only choose j accordng to (2), whch sply requres eleent coparson. Ths s also the strategy utlzed for C-SVM n LIBSVM [7]. For the ntalzaton, we use = τc. Recallng (5) for the settng of C, one can verfy that = τc gves a feasble soluton of (7). When τ = 0, the ntal soluton s = 0, whch s coonly used for C-SVM. If we know the optal soluton for pn-svm wth τ, denoted by (τ), then we can have a good guess for pn-svm wth τ 2. To observe the lnk between (τ) and (τ2), we llustrate a sple classfcaton task two oons n Fg.2, where the red crosses and the green stars correspond to observatons n class + and class, respectvely. We use pn-svm (7) to tran the classfer. In ths exaple, the sae radal bass functon (RBF) kernel and the sae regularzaton paraeter, but dfferent τ values are used. The surfaces x : f(x) =, +} are dsplayed n Fg.2. Accordng to the copleentary slackness condtons, we know that y f(x ) >, S = j : j = τc j }, y f(x ) = 0, S 0 = j : τc j < j < C j }, y f(x ) <, S + = j : j = C j }. 4 x(2) x() Fgure 2: Saplng ponts and classfcaton results of pn-svm. Ponts n class + and are shown by green stars and red crosses, respectvely. The surfaces x : f(x) = } (blue lnes) and x : f(x) = } (black lnes) for τ = 0, 0.05,0. are dsplayed by sold, dash-dotted, and dotted lnes, respectvely. In other words, the surfaces x : f(x) = ±} partton the tranng sets nto three parts. Most of the dual varables take the values τc or C. The left data are located n x : f(x) = +} or x : f(x) = }. Fro Fg.2, we observe that for any ponts, they are located n the sae part for dfferent τ. Fg.2 also llustrates that wth the ncreasng τ, the surfaces f(x ) = ± ove towards the decson boundary. Ths can be observed as well fro the pral for (2), of whch the optalty condton can be wrtten as the exstence of η [ τ, ] such that w j C y φ j (x )+τ y φ j (x ) η y φ j (x ) = 0, j. S S + S 0 Ths condton ples that generally a larger τ results n ore data fallng nto S. Therefore, f τ > τ 2 and the dfference s not bg, t s wth a hgh probablty that (τ2) = τ 2 C f (τ) = τ C. Followng fro ths dscusson, we suggest Algorth for the ntal soluton. By the proposed procedure, we fnd a new feasble soluton, whch s heurstcally sutable for τ 2. When tunng the paraeter τ, we need to tran pn-svm for a seres of τ values, for whch the above procedure can be appled. Now we gve the SMO algorth for pn-svm () n Algorth 2, where e s a pre-defned accuracy and s set to be 0 6 n ths paper. 3. SMO for Sparse pn-svm Pn-SVM can be regarded as an extenson to C-SVM va ntroducng flexblty on τ. Snce quantle dstances are consdered, pn-svm s nsenstve to feature nose and has shown better classfcaton accuracy over C-SVM. In pn-svm (7), the dual varables are categorzed nto three types: lower bounded support vectors ( = τc ), free support vectors ( τc < < C ), and upper bounded support vectors ( = C ). When τ = 0, pn-svm reduces to C-SVM. Correspondngly, the lower bounded support vectors are zero and C-SVM has sparseness. To pursue

5 Algorth : Intalzaton for pn-svm wth τ 2 fro (τ) Set S (τ ) := : (τ) = τ C }, S (τ ) + := : (τ) = C }; Let := τ 2 C, S (τ ) := C, S (τ ) + ; and Calculate the volaton v := = y ; f τ 2 > τ then repeat select fro : y = sgn(v)} S (τ ) + ; set := axc v, τ 2 C }; update v := ax0, v ( + τ 2 )C }; untl v = 0 ; else repeat select fro : y = sgn(v)} S (τ ) ; set := ax τ 2 C + v, C }; update v := ax0, v ( + τ 2 )C }; untl v = 0 ; end Return as the ntal soluton for pn-svm wth τ 2. Algorth 2: SMO for pn-svm Set := τc or use Algorth to generate ; Calculate g := y j= y j j K j and set A := τc, y = C, y = B C, y := = τc, y = ; repeat Iup := : y < B }, I := : y > A }; select := arg ax y l g l ; l Iup select j := arg ax l I calculate the update length ζ := (y g y lg l )2 2(K +K ll 2K l ) ; B y g, y j g j A j, y g y jg j K +K jj 2K j }; update := + y ζ and j := j + y j ζ, g l := g l ζy l K l + ζy l K jl, l =,...,; untl ax I up y g j I y j g j < e ; Calculate b := 2 (ax I up y g + j I y j g j ). sparseness for pn-svm wth a nonzero τ value, a loss functon wth an ε nsenstve zone was appled. Then a sparse pn-svm has been establshed n [9]. In the pral space, sparse pn-svm can be posed as w,b 2 w 2 + C = L ε τ ( y (w T φ(x ) + b) ), (3) where the pnball loss wth an ε nsenstve zone L ε τ(u) s defned n (3). The dual proble of (3) has been deduced n [9] and takes the followng for,,γ 2 y K j y j j ε = j= = = y = 0, (4) = γ 0, =,...,, τ(c γ ) C γ, =,...,. The possble range of the dual varable γ s 0 γ C. When γ takes value C, the correspondng wll be zero, whch brngs sparsty to pn-svm. Fro the objectve functon of (4), one can see that a large ε wll push γ close to C,.e., there are ore values zero. The last constrant n (4) can be vewed as a box constrant on and the box depends on another dual varable γ. Slarly to the dscusson on pn-svm (7), we can wrte τ(c γ ) C γ as where and A γ y B γ, A γ τ(c γ = ), y =, (C γ ), y =, B γ = C γ, y =, τ(c γ ), y =. Then for gven γ and, we can fnd the two followng sets, and I,γ up = : y < B γ or γ > 0}, I,γ = : y > A γ or γ > 0}. Here γ > 0 can guarantee that ± ζ s feasble for suffcently sall scalar ζ. Then, necessary condtons for, γ beng optal to (4) can be presented as follows: for a gven γ value, should satsfy: ax y g Iup,γ j I,γ γ y j gj, and y = 0; = for a gven value, γ should satsfy: γ = C + τ }, C. 5

6 Notce that n sparse pn-svm (4), the gradent g s dfferent fro that n pn-svm (6), snce there s one addtonal freedo on γ. Specfcally, there are three stuatons. If = C γ, then g = y y j j K j + ε. j= If = τ(c γ ), then we have g = y y j j K j ε τ. j= Otherwse,.e., τ(c γ ) < < C γ, we have g = y y j j K j. j= The above condtons are gven separately for and γ. For sparse pn-svm (4), and γ are coupled n the constrants. Hence these condtons are necessary but not suffcent. However, to pursue an effcent solvng ethod for (4), we apply the above necessary condton to choose two data ponts n a workng set. Then the selected dual varables are odfed and the others are unchanged. Slarly to pn-svm, the workng set for sparse pn- SVM (4) contans at least two data ponts. Suppose that, j are selected. Then to update,j, γ,j, we are to solve the followng QP proble,j,γ,j 2 K 2 + jy j K j y j j + 2 K jj 2 j + y y l l K l + j y j y l l K jl l =,j l =,j j εγ εγ j y + y j j = y l l, (5) γ 0, γ j 0, l =,j τ(c γ ) C γ, τ(c j γ j ) j C j γ j. When γ,j are fxed, (5) reduces to a 2-densonal QP wth one equalty constrant, whch has an explct soluton. Ths s the case for pn-svm (7). However, n sparse pn-svm, γ,j and,j are coupled and there s no explct soluton. Hence, we have to solve (5) to update,j, γ,j at each teraton. Solvng (5) decreases the objectve of (4). We should choose the reasonable workng set accordng to the gan of solvng (5). The gan s better than the case keepng γ,j unchanged. For the case γ,j fxed, the gan s (0), fro whch we can estate the gan for (5) and then select the workng set by the followng rule: j = arg ax l I,γ y lg up l, = arg ax l I,γ (y g y lg l )2 2(K +K ll 2K l ). 6 Ths selecton strategy s slar to that for pn-svm, but now t s dependent on γ. The ntal soluton for pn-svm = τc s also feasble to sparse pn-svm (4). Correspondngly, the ntal γ s set to be γ = C + τ, C }, whch s accordng to the necessary optal condton. Now the sequental al optzaton for sparse pn-svm (4) s suarzed n Algorth 3. Algorth 3: SMO for sparse pn-svm Set := τc and γ := C + τ, C } ; Calculate g := y j= y j j K j ; A γ τ(c γ := ), y = (C γ ), y = ; B γ := C γ, y = τ(c γ ), y = ; repeat I,γ up := : y < B γ or γ > 0}; I,γ := : y > A γ or γ > 0}; select := arg ax y l g l ; l Iup (y g y lg l )2 2(K +K ll 2K l ) ; select j := arg ax l I solve (5) to update,j, γ,j ; update A γ, Bγ, and g l, l =,..., ; untl ax I,γ y g up j I,γ y j g j < e ; ( ax I,γ Calculate b := 2 4. Nuercal Experents up y g + j I,γ y j g j ). In the above sectons, we gave the SMO algorths for tranng pn-svm (7) and sparse pn-svm (4). In the followng, we wll evaluate ther perforance on real-lfe data sets. There are two concerned aspects. Frst, we wll test whether SMO s effectve for tranng pn-svms. Second, wth an effectve tranng ethod, we can consder ore experents and support the theoretcal analyss n [9]. The sparsty of sparse pn-svm s also consdered. The data n these experents are loaded fro the UCI Repostory of Machne Learnng Datasets [23] and LIBSVM data sets [7]. For soe of these data, the tranng and test sets are provded. Otherwse, we randoly select observatons to tran the classfer and use the reanng for test. The proble denson n, the nuber of the tranng data, and the nuber of test data T are suarzed n Table. In pn-svm (7), we use the RBF kernel and apply Algorth 2 to tran the classfers wth dfferent τ values. Wth the data sze grows, cachng for the kernel atrx becoes larger. In our experents, when 5000, we calculate eleent K j only when needed, whch reduces the cachng but costs ore te. To ake a far coparson,

7 Table : Denson, Tranng Data and Test Data Sze nae n T nae n T Spect Pa Monk Breast Monk Splce Haberan Spabase Statlog Gude Monk Magc Ionosphere IJCNN Transfuson Cod RNA we use = τc as the ntal soluton. If the nuber of the tranng data s less than 0000, 0-fold crossvaldaton s utlzed to tune the regularzaton coeffcent C 0 and the bandwdth n the RBF kernel σ. Otherwse, we set C 0 = and tune σ only. The tranng and test process s repeated 0 tes. Then the average accuracy on test sets, the standard devaton, and the average coputng te are reported n Table 2. Table 2: Test Accuracy and Average Tranng Te Data τ = 0 τ = 0. τ = 0.3 τ = 0.5 Spect ± ± ± ± s 9.06 s 8.92 s 8.94 s Monk ± ± ± ± s 20.5 s 25.5 s 28.6 s Monk 8.97 ± ± ± ± s 22.6 s 24.2 s 27.0 s Haber ± ± ± ± s 24.4 s 24.9 s 24.6 s Statlog ± ± ± ± s 27.8 s 30.2 s 32.8 s Monk ± ± ± ± s 34.3 s 37. s 39.4 s Iono ± ± ± ± s 40.6 s 44.4 s 47.4 s Trans ± ± ± ± s 28.4 s 29.0 s 28.8 s Pa 74.4 ± ± ± ± 3.4 s 26 s 35 s 44 s Breast ± ± ± ± s 7.3 s 73.9 s 74.0 s Splce ± ± ± ± s 93.2 s 98.9 s 99.3 s Spab ± ± ± ± s 68 s 7 s 67 s Gude ± ± ± ± s 8 s 95 s 20 s Magc 85.0 ± ± ± ± s 29.0 s 30. s 30.3 s IJCNN ± ± ± ±. 47 s 22 s 23 s 209 s RNA ± ± ± ± s 4 s 24 s 4 s We also llustrate the scalablty of the proposed SMO algorth by plottng the tranng te for dfferent tranng data szes. In Fg.3 we plot the tranng te for data set IJCNN. Notce that there s a sudden change at = 5000, due to dfferent kernel coputaton strateges. Both Table 2 and Fg.3 llustrate that the proposed SMO ethod can tran pn-svm effectvely. For dfferent τ values, the coputatonal te s slar and s not onotonc wth respect to τ. In our ethod, pn-svm s traned n the dual space, whch corresponds to a QP wth box constrants τc C. One can observe that τ controls the sze of the feasble set. In two extree cases,.e., when the box s large enough or very sall, optal solutons can be obtaned easly. Therefore, though a larger τ s generally related to ore tranng te, the dfference s not sgnfcant. In soe applcatons, a larger τ even corresponds to less tranng te. Generally, the proposed te (s) te (s) (a) 0 5,000 0,000 20,000 30,000 40,000 (b) Fgure 3: Tranng te of Algorth 2 (τ = 0.) for IJCNN for dfferent tranng data szes. (a) < 5000; (b) SMO for pn-svm s effectve and t takes slar tranng te as SMO for C-SVM. Wth a properly selected τ, pn-svm provdes better classfcaton accuracy over C-SVM. But the sparseness s lost. If the proble sze s not too large and sparseness s not the an target, then fndng a sutable τ s eanngful for provng the classfcaton accuracy. Moreover, we can use sparse pn-svm (4) to enhance the sparsty. In the followng, we set τ = 0. and apply Algorth 3 for several dfferent ε values. The tranng and test process s slar to the prevous experent, except that the paraeters for sparse pn-svm are tuned based on pn-svm, snce Algorth 3 costs ore te than Algorth 2. In practce, f the allowed te s not strct, one can tune the paraeters based on sparse pn-svm and prove the perforance further. The average classfcaton accuracy, the nuber of support vectors (n brackets), and the tranng te are reported n Table 3, where the results of C-SVM are gven as well for reference. Copared wth pn-svm (7), sparse pn-svm (4) enhances the sparsty, but takes ore tranng te. In Algorth 3, the update forulaton nvolves a 4-densonal QP proble. Though t can be solved effectvely, ts coputaton te s larger than the explct update forulaton n Algorth 2. Roughly, Algorth 3 needs 0 tes ore than Algorth 2. In C-SVM, the ponts wth y f(x ) < are related to zero dual varables and so are the ponts wth ε τ < y f(x ) < ε n sparse pn-svm. Thus, the results of C-SVM are generally ore sparse. But when the feature nose s heavy, t s worthy consderng Algorth 3 to tran sparse pn-svm. 7

8 Table 3: Test Accuracy, Nuber of Nonzero Dual Varables, and Tranng Te for Sparse pn-svm (τ = 0.) Data C-SVM ε = 0.05 ε = 0.0 ε = 0.20 Spect (69) (66) (62) (60) 8.96 s 08 s 75.3 s 90.4 s Monk (83) (97) 9.44 (87) (86) 6.5 s 43 s 3 s 27 s Monk 8.97 (68) 79.5 (00) (93) (88) 8.8 s 26 s 39 s 27 s Haber (40) (40) (39) (37) 26.5 s 54 s 50 s 77 s Statlog (99) 83. (22) 82. (8) 8.4 (0) 24.2 s 55 s 43 s 39 s Monk (0) (07) (98) 8.45 (90) 29.4 s 246 s 277 s 253 s Iono (99) (09) (98) (87) 32.8 s 277 s 243 s 243 s Trans (286) (272) (26) (95) 34.5 s 250 s 252 s 250 s Pa 74.4 (337) 74.0 (354) 7.29 (346) (336) s 535 s 502 s 486 s Breast (89) (37) (26) (99) 57.4 s 445 s 469 s 483 s Splce (27) 83. (392) (322) (234) 02 s 749 s 652 s 659 s Spab (290) 9.28 (906) 9.2 (864) 9.20 (780) 200 s 74 s 755 s 697 s Gude (345) (208) (684) (203) 58 s 2.74 s 2.53 s 2.34 s 5. Concluson In ths paper, sequental al optzaton has been establshed for the support vector achne wth the pnball loss. Snce pn-svm has the sae proble structure as C-SVM, the correspondng SMO s related to that for C- SVM. We nvestgated the detals and pleented SMO for pn-svm. The SMO for tranng sparse pn-svm was gven as well. Then the proposed algorths were evaluated on nuercal experents, showng the effectveness of tranng pn-svms. The proposed SMO algorths ake pn-svms prosng tools n real-lfe applcaton, especally when the data are corrupted by feature nose. Acknowledgent The authors would lke to thank Prof. Chh-Jen Ln n Natonal Tawan Unversty for encouragng us to establsh the SMO algorth for pn-svm. The authors are grateful to the anonyous revewers for helpful coents. [7] H. Xu, C. Caraans, and S. Mannor. Robustness and regularzaton of support vector achnes. The Journal of Machne Learnng Research, 0:485 50, [8] B. Schölkopf, A.J. Sola, R.C. Wllason, and P.L. Bartlett. New support vector algorths. Neural Coputaton, 2(5): , [9] X. Huang, L. Sh, and J.A.K. Suykens. Support vector achne classfer wth pnball loss. IEEE Transactons on Pattern Analyss and Machne Intellgence, 36(5): , 204. [0] R. Koenker. Quantle Regresson. Cabrdge Unversty Press, [] I. Stenwart and A. Chrstann. Estatng condtonal quantles wth the help of the pnball loss. Bernoull, 7(): 2 225, 20. [2] J.C. Platt. Fast tranng of support vector achnes usng sequental al optzaton. In Advances n kernel ethods Support Vector Learnng, pages MIT Press, 999. [3] R.E. Fan, P.H. Chen, and C.J. Ln. Workng set selecton usng second order nforaton for tranng support vector achnes. The Journal of Machne Learnng Research, 6:889 98, [4] L. Bottou and C.-J. Ln. Support vector achne solvers. n Large Scale Kernel achnes, pages MIT Press, [5] Y. Tor and S. Abe. Decoposton technques for tranng lnear prograg support vector achnes. Neurocoputng, 72(4): , [6] J. Shawe-Taylor and S. Sun. A revew of optzaton ethodologes n support vector achnes. Neurocoputng, 74(7): , 20. [7] C.C. Chang and C.J. Ln. LIBSVM: a lbrary for support vector achnes. ACM Transactons on Intellgent Systes and Technology, 2(3):27, 20. [8] C.C. Chang, C.W. Hsu, and C.J. Ln. The analyss of decoposton ethods for support vector achnes. IEEE Transactons on Neural Networks, (4): , [9] C.J. Ln. On the convergence of the decoposton ethod for support vector achnes. IEEE Transactons on Neural Networks, 2(6): , 200. [20] S.S. Keerth and E.G. Glbert. Convergence of a generalzed SMO algorth for SVM classfer desgn. Machne Learnng, 46 (-3):35 360, [2] D. Hush, P. Kelly, C. Scovel, and I. Stenwart. QP algorths wth guaranteed accuracy and run te for support vector achnes. The Journal of Machne Learnng Research, 7: , [22] J. López and J.R. Dorronsoro. Sple proof of convergence of the SMO algorth for dfferent SVM varants. IEEE Transactons on Neural Networks and Learnng Systes, 23(7):42 47, 202. [23] A. Frank and A. Asuncon. UCI achne learnng repostory, 200. References [] C. Cortes and V. Vapnk, Support-vector networks. Machne Learnng, 20: , 995. [2] V. Vapnk. Statstcal Learnng Theory. Wley, New York, 998. [3] X. Zhang. Usng class-center vectors to buld support vector achnes. In Proceedngs of the IEEE Sgnal Processng Socety Workshop, pages 3. IEEE, 999. [4] J. B and T. Zhang. Support vector classfcaton wth nput data uncertanty. In Advances n Neural Inforaton Processng Systes, volue 7, page 6. MIT Press, [5] G.R.G. Lanckret, L.E. Ghaou, C. Bhattacharyya, and M.I. Jordan. A robust ax approach to classfcaton. The Journal of Machne Learnng Research, 3: , [6] P.K. Shvasway, C. Bhattacharyya, and A.J. Sola. Second order cone prograg approaches for handlng ssng and uncertan data. The Journal of Machne Learnng Research, 7:283 34,

Excess Error, Approximation Error, and Estimation Error

Excess Error, Approximation Error, and Estimation Error E0 370 Statstcal Learnng Theory Lecture 10 Sep 15, 011 Excess Error, Approxaton Error, and Estaton Error Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton So far, we have consdered the fnte saple