A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM

Size: px

Start display at page:

Download "A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM"

Brett Allen
5 years ago
Views:

1 A Study on L2-Loss (Squared Hnge-Loss) Mult-Class SVM Chng-Pe Lee and Chh-Jen Ln Department of Computer Scence, Natonal Tawan Unversty, Tape 10617, Tawan Keywords: Support vector machnes, Mult-class classfcaton, Squared hnge loss, L2 loss. Abstract Crammer and Snger s method s one of the most popular mult-class SVMs. It consders L1 loss (hnge loss) n a complcated optmzaton problem. In SVM, squared hnge loss (L2 loss) s a common alternatve to L1 loss, but surprsngly we have not seen any paper studyng detals of Crammer and Snger s method usng L2 loss. In ths note, we conduct a thorough nvestgaton. We show that the dervaton s not trval and has some subtle dfferences from the L1 case. Detals provded n ths work can be a useful reference for those who ntend to use Crammer and Snger s method wth L2 loss. They do not need a tedous process to derve everythng by themselves. Further, we present some new results/dscusson for both L1- and L2-loss formulatons. 1 Introducton Support Vector Machnes (SVM) (Boser et al., 1992; Cortes and Vapnk, 1995) were orgnally desgned for bnary classfcaton. In recent years, many approaches have been proposed to extend SVM to handle mult-class classfcaton problems; see, for example, a detaled comparson n Hsu and Ln (2002). Among these works, the method proposed by Crammer and Snger (2001, 2002) has been wdely used. They extend the optmzaton problem of L1-loss (hnge-loss) SVM to a mult-class formula. In bnary classfcaton, squared hnge loss (L2 loss) s a common alternatve to L1 loss, but surprsngly we have not found any paper studyng detals of Crammer and Snger s method wth L2 loss. Ths s nconvenent because, for example, we do not know what the dual problem s. 1 Although the dual problem of two-class SVM usng L2 loss s well known, t cannot be drectly extended to the mult-class case. In fact, the dervaton s non-trval and has some subtle dfferences from the L1 case. Also, the algorthm to solve the dual problem (for both kernel and lnear stuatons) must be modfed. We thnk there s a need to gve all the detals for future references. Then those who ntend to use Crammer and 1 Indeed, the dual problem has been provded n some works of structured SVM, where Crammer and Snger s mult-class SVM s a specal case. However, ther form can be smplfed for mult-class SVM; see a detaled dscusson n Secton 2.2.

2 Snger s method wth L2 loss do not need a tedous procedure to derve everythng by themselves. In addton to the man contrbuton to nvestgate L2-loss mult-class SVM, we present some new results for both L1- and L2-loss cases. Frst, we dscuss the dfferences between the dual problem of Crammer and Snger s mult-class SVM and that of structured SVM, although we focus more on the comparson of L2-loss formulaton. Second, we gve a smpler dervaton for solvng the sub-problem n the decomposton method to mnmze the dual problem. Ths paper s organzed as follows. In Secton 2, we ntroduce L2-loss multclass SVM and derve ts dual problem. We dscuss the connecton to structured SVM, whch s a generalzaton of Crammer and Snger s mult-class SVM. Then n Secton 3, we extend a decomposton method to solve the optmzaton problem. In partcular, we obtan the sub-problem to be solved at each teraton. A procedure to fnd the soluton of the sub-problem s gven n Secton 4. Our dervaton and proof are smpler than Crammer and Snger s. In Secton 5, we dscuss some mplementaton ssues and extensons. Experments n Secton 6 compare the performance of L1-loss and L2-loss mult-class SVM usng both lnear and nonlnear kernels. Secton 7 then concludes ths paper. 2 Formulaton Gven a set of nstance-label pars (x, y ), x R n, y {1,..., k}, = 1,..., l, Crammer and Snger (2002) proposed a mult-class SVM approach by solvng the followng optmzaton problem. mn w 1,...,w k,ξ 1 2 w T mw m + C subject to w T y x w T mx e m ξ, (1) ξ = 1,..., l, m = 1,..., k, where e m = { 0 f y = m, 1 f y m. Note that f y = m, the constrant s the same as ξ 0. The decson functon for predctng the label of an nstance x s (2) arg max,...,k wt mx. 2

3 The dual problem of (1) s where subject to 1 mn α 2 j=1 K,j α m αj m + α m e m α m = 0, = 1,..., l, (3) α m 0, = 1,..., l, m = 1,..., k, m y, (4) α y C, = 1,..., l, (5) α = [α 1 1,..., α k 1,..., α 1 l,..., α k l ] T and K,j = x T x j. (6) Constrants (4) and (5) are often combned as { α m Cy m, where Cy m 0 f y m, = C f y = m. We separate them n order to compare wth the dual problem of usng L2 loss. After solvng (3), one can compute the optmal w m by w m = α m x, m = 1,..., k. (7) In ths paper, we extend problem (1) to use L2 loss. By changng the loss term from ξ to ξ 2, the prmal problem becomes mn w 1,...,w k,ξ 1 2 w T mw m + C subject to w T y x w T mx e m ξ, (8) ξ 2 = 1,..., l, m = 1,..., k. The constrant w T y x w T mx e m ξ when m = y can be removed because for L2 loss, ξ 0 holds at an optmum wthout ths constrant. We keep t here n order to compare wth the formulaton of usng L1 loss. We wll derve the followng dual problem. mn α subject to f(α) α m = 0, = 1,..., l, (9) α m 0, = 1,..., l, m = 1,..., k, m y, where f(α) = 1 2 K,j α m αj m + j=1 3 α m e m + (α y )2 4C.

4 Problem (9) s smlar to (3), but t possesses an addtonal quadratc term n the objectve functon. Further, the constrant on α y s dfferent. In (5), α y C, but n (9), α y s unconstraned. 2 We dscuss two methods to derve the dual problem. The frst s from a drect calculaton, whle the second follows from the dervaton of structured SVM. 2.1 A Drect Calculaton to Obtan the Dual Problem The Lagrange functon of (8) s L(w 1,..., w k, ξ, ˆα) = 1 2 w T mw m + C ξ 2 ˆα m (w T y x w T mx e m + ξ ), where ˆα m 0, m = 1,..., k, = 1,..., l, are Lagrange multplers. The Lagrange dual problem s ( max nf L(w 1,..., w k, ξ, ˆα) ). ˆα: ˆα m 0,,m w 1,...,w k,ξ To mnmze L under fxed ˆα, we rewrte the followng term n the Lagrange functon and have ˆα m w T y x = wm L = 0 w m ξ L = 2Cξ :y =m s=1 We smplfy (11) by defnng α m (1 e m ) ˆα s w T y x = ( (1 e m ) s=1 ˆα m = 0 ξ = w T m (1 e m ) ˆα s x, (10) s=1 ) ˆα s ˆα m x = 0, m = 1,..., k, (11) k ˆαm 2C, = 1,..., l. (12) ˆα s ˆα m, = 1,..., l, m = 1,..., k. (13) s=1 Ths defnton s equvalent to α m = ˆα m, m y, (14) α y = ˆα m = α m. (15) m:m y m:m y 2 Indeed, usng k αm = 0 and α m 0, m y, we have α y dual problems of L1 and L2 cases. 4 0 for both

5 Therefore, we can rewrte the soluton of mnmzng L under fxed ˆα as w m = α m x, m = 1,..., k, (16) ξ = ˆαy + α y 2C, = 1,..., l. (17) By (2), (10), (11), and (14)-(17), the Lagrange dual functon s L(w 1,..., w k, ξ, ˆα) = 1 (w 2 m) T w m = 1 2 C (ξ ) 2 + ˆα m ( (w y ) T x (w m) T x ) ˆα m e m (18) ( (w m) T w m (w m) T (1 e m ) = 1 2 = 1 2 = 1 2 ) (w m) T ˆα m x C (w m) T w m C α m x 2 j=1 (ξ ) 2 + (ξ ) 2 + K,j α m αj m (ˆα y + α y )2 4C α m e m ˆα s x s=1 ˆα m e m ˆα m e m α m e m (ˆα y + α y 4C )2. (19) Because ˆα m, m y do not appear n (19), from (14) and (15), the dual problem s mn α, ˆα subject to 1 2 j=1 K,j α m αj m + α m e m + (ˆα y + α y 4C )2 ˆα y 0, = 1,..., l, α m = 0, = 1,..., l, (20) α m 0, = 1,..., l, m = 1,..., k, m y. (21). Because (20) and (21) mply α y the optmal ˆα y problem s complete. 0 and ˆα y must be zero. After removng ˆα y appears only n the term (ˆα y +α y )2,, the dervaton of the dual 5

6 We dscuss the dfference from L1 loss. If L1 loss s used, (12) becomes ˆα m = C, (22) and C l (ξ ) 2 n (18) dsappears. Equaton (22) and the fact ˆα y 0 lead to the constrant (5) as follows. α y = m:m y ˆα m = C ˆα y C. For L2 loss, wthout the condton n (22), α y s unconstraned. 2.2 Usng Structured SVM Formulaton to Obtan the Dual Problem It s well known that Crammer and Snger s mult-class SVM s a specal case of structured SVM (Tsochantards et al., 2005). By defnng w w 1. w k R kn 1 and δ(, m) R kn 1, wth δ(, m) 0 x 0 x 0 y -th poston m-th poston, f y m, and 0 f y = m, problem (8) can be wrtten as mn w,ξ 1 2 w 2 + C ξ 2 subject to w T δ(, m) e m ξ, (23) = 1,..., l, m = 1,..., k. Ths problem s n a smlar form to L2-loss bnary SVM, so the dervaton of the dual problem s straght forward. Followng Tsochantards et al. (2005), the dual problem s 3 3 Problem (24) s slghtly dfferent from that n Tsochantards et al. (2005) because they remove the constrants ξ 0, by settng m y n (23). 6

7 1 mn ˆα 2 s=1 j=1 ˆα m e m + δ(, m) T δ(j, s)ˆα m ˆα s j ( k ˆαm ) 2 4C subject to ˆα m 0, = 1,..., l, m = 1,..., k. (24) Also, at an optmal soluton, we have w = ˆα m δ(, m) and ξ = k ˆαm 2C. (25) Problem (24) seems to be very dfferent from problem (9) obtaned n Secton 2.1. In fact, problem (24) s an ntermedate result n our dervaton. A careful check shows 1. ˆα s the same as the Lagrange multpler used n Secton w n (25) s the same as that n (7); see Equaton (11). In Secton 2.1, we ntroduce a new varable α and smplfes the two terms δ(, m) T δ(j, s)ˆα m ˆα j s and s=1 j=1 ( ˆα m ) 2 to j=1 K,j α m αj m and (α y )2 4C, respectvely. An advantage of problem (9) s that K,j = x T x j explctly appears n the objectve functon. In contrast, δ(, m) T δ(j, m) does not reveal detals of the nner product between nstances. However, a caveat of (9) s that t contans some lnear constrants. An nterestng queston s whether the smplfcaton from (24) to (9) allows us to apply a smpler or more effcent optmzaton algorthm. Ths ssue already occurs for usng L1 loss because we can ether solve problem (3) or a form smlar to (24). However, the dual problem of L1-loss structured SVM contans a lnear constrant, but problem (24) does not. 4 Therefore, for the L1 case, t s easy to see that the smplfed form (3) should be used. However, for L2 loss, problem (24) possesses an advantage of beng a bound-constraned problem. We wll gve some dscusson about solvng (9) or (24) n Secton 5.5. In all remanng places we focus on problem (9) because exstng mplementatons for the L1-loss formulaton all solve the correspondng problem (3). 4 See Proposton 5 n Tsochantards et al. (2005). 7

8 3 Decomposton Method and Sub-problem Decomposton methods are currently the major method to solve the dual problem (3) of the L1 case (Crammer and Snger, 2002; Keerth et al., 2008). At each teraton, the k varables α 1,..., α k assocated wth an nstance x are selected for updatng, whle other varables are fxed. For (3), the followng sub-problem s solved. mn α 1,...,αk subject to ( 1 2 A(αm ) 2 + B m α m ) α m = 0, (26) α m C m y, m = 1,..., k, where A = K, and B m = K j, ᾱj m + e m Aᾱ m. (27) In (27), ᾱ s the soluton obtaned n the prevous teraton. We defer the dscusson on the selecton of the ndex n Secton 5. For problem (9), we show that the sub-problem s: mn α 1,...,αk subject to j=1 ( 1 2 A(αm ) 2 + B m α m ) + (αy )2 4C α m = 0, (28) α m 0, m = 1,..., k, m y, where A and B m are the same as (27). The dervaton of (28) s as follows. Because all elements except α 1,..., α k are fxed, the objectve functon of (9) becomes = 1( K, (α m ) ) K,j ᾱj m α m + α m e m 2 j:j (1 2 K,(α m ) 2 + ( K j, ᾱj m + e m K, ᾱ m )α m j=1 + (αy )2 4C + constants ) + (α y )2 4C + constants. (29) Equaton (29) then leads to the objectve functon of (28), whle the constrants are drectly obtaned from (9). Note that B m = w T mx + e m K, ᾱ m (30) 8

9 f are mantaned. 5 w m = ᾱ m x, m = 1,..., k 4 Solvng the Sub-problem We dscuss how to solve the sub-problem when A > 0. If A = 0, then x = 0. Thus ths nstance gves a constant value ξ = 1 to the prmal objectve functon (8), and the value of α m, m = 1,..., k have no effect on w m defned n (16), so we can skp solvng the sub-problem. We follow the approach by Crammer and Snger to solve the sub-problem, although there are some nterestng dfferences. Ther method frst computes D m = B m + AC m y, m = 1,..., k. Then t starts wth a set Φ = φ and sequentally adds one ndex m to Φ by the decreasng order of D m untl the followng nequalty s satsfed. β = AC + m Φ D m Φ max m/ Φ D m. (31) The optmal soluton of (26) s computed by: α m = mn(cy m, β B m ), m = 1,..., k. (32) A Crammer and Snger gave a lengthy proof to show the correctness of ths method. Our contrbuton here s to derve the algorthm and prove ts correctness by easly analyzng the KKT optmalty condton. We now derve an algorthm for solvng (28). Let us defne A y A + 1/2C. The KKT condtons of (28) ndcate that there are scalars β, ρ m, m = 1,..., k, such that α m = 0, (33) α m 0, m y, (34) ρ m α m = 0, ρ m 0, m y, (35) Aα m + B m β = ρ m, m y, (36) A y α y + B y β = 0. (37) 5 See detals of solvng lnear Crammer and Snger s mult-class SVM n Keerth et al. (2008). 9

10 Usng (34), Equatons (35) and (36) are equvalent to Aα m + B m β = 0, f α m < 0, m y, (38) Aα m + B m β = B m β 0, f α m = 0, m y. (39) Now KKT condtons become (33)-(34), (37), and (38)-(39). If β s known, we prove that { mn(0, β B m α m A ) f m y, β B y (40) A y f m = y, satsfes all KKT condtons except (33). Clearly, the way to get α m n (40) ensures α m 0, m y, so (34) holds. From (40), when β < B m, we have α m < 0 and β B m = Aα m. Thus, (38) s satsfed. Otherwse, β B m and α m = 0 satsfy (39). Also notce that α y s drectly obtaned from (37). The remanng task s how to fnd β such that (33) holds. From (33), (37), and (38) we obtan A y (β B m ) + A(β B y ) = 0. m:α m <0 Hence, β = AB y + A y A + m:α m <0 B m m:α m <0 A y = A A y B y + m:α m <0 B m A A y + {m α m < 0}. (41) Combnng (41) and (39), we begn wth a set Φ = φ, and then sequentally add one ndex m to Φ by the decreasng order of B m, m = 1,..., k, m y, untl h = A A y B y + m Φ B m A A y + Φ max m/ Φ B m. (42) Let β = h when (42) s satsfed. Algorthm 1 lsts the detals for solvng the sub-problem (28). To prove (33), t s suffcent to show that β and α m, m, obtaned by Algorthm 1 satsfes (41). Ths s equvalent to showng that the set Φ of ndces ncluded n step 5 of Algorthm 1 satsfes Φ = {m α m < 0}. From (40), we prove the followng equvalent result. β < B m, m Φ and β B m, m / Φ. (43) The second nequalty mmedately follows from (42). For the frst, assume t s the last element added to Φ. When t s consdered, (42) s not satsfed yet, so A A y B y + B m m Φ\{t} A A y + Φ 1 10 < B t. (44)

11 ALGORITHM 1: Solvng the sub-problem 1. Gven A, A y, and B = {B 1,..., B k }. 2. D B 3. Swap D 1 and D y, then sort D \ {D 1 } n decreasng order. 4. r 2, β D 1 A/A y 5. Whle r k and β/(r 2 + A/A y ) < D r 5.1. β β + D r 5.2. r r β β/(r 2 + A/A y ) 7. α m mn ( 0, (β B m )/A ), m y 8. α y (β B y )/A y Usng (44) and the fact that elements n Φ are added n the decreasng order of B m, A B y + B m = A B y + A y A y m Φ m Φ\{t} < ( Φ 1 + A A y )B t + B t = ( Φ + A A y )B t ( Φ + A A y )B s, s Φ. B m + B t Thus, we have the frst nequalty n (43). Wth all KKT condtons satsfed, Algorthm 1 obtans an optmal soluton of (28). By comparng (31), (32) and (42), (40) respectvely, we can see that the procedures for L1 loss and L2 loss are smlar but dfferent n several aspects. In partcular, because α y s unconstraned, B y s consdered dfferently from other B m s n (42). 5 Other Issues and Extensons In ths secton, we dscuss other detals of the decomposton method. Some of them are smlar to those for the L1 case. We also extend problems (8)-(9) to more general settngs. In the end we dscuss advantages and dsadvantages of solvng two dual forms (9) and (24). 5.1 Extensons to Use Kernels It s straghtforward to extend our algorthm to use kernels. The only change s to replace K,j = x T x j 11

12 n (6) wth K,j = φ(x ) T φ(x j ), (45) where φ(x) s a functon mappng data to a hgher dmensonal space. 5.2 Workng Set Selecton We mentoned n Secton 3 that at each teraton of the decomposton method, an ndex s selected so that α 1,..., α k are updated. Ths procedure s called workng set selecton. If kernels are not used, we follow Keerth et al. (2008) to sequentally select {1,..., l}. 6 For lnear SVM, t s known that more sophstcated selectons such as usng gradent nformaton may not be cost-effectve; see the detaled dscusson n Secton 4.1 of Hseh et al. (2008). For kernel SVM, we can use gradent nformaton for workng set selecton because the cost s relatvely low compared to that of kernel evaluatons. In Crammer and Snger (2001), to solve problems wth L1 loss, they select an ndex by = arg max ϕ, (46) {1,...,l} where ϕ = max 1 m k ĝm mn ĝ m:α m m, (47) <Cm y and ĝ m, = 1,..., l, m = 1,..., k, are the gradent of (3) s objectve functon. The reason behnd (46) s that ϕ shows the volaton of the optmalty condton. Note that for problem (3), α s optmal f and only f α s feasble and ϕ = 0, = 1,..., l. (48) See the dervaton n Crammer and Snger (2001, Secton 5). For L2 loss, we can apply a smlar settng by where g m = j=1 ϕ = max 1 m k gm mn g m:α m <0 or m=y m, = 1,..., l, K,j αj m + e m + (1 e m ) αm, = 1,..., l, m = 1,..., k, 2C are the gradent of the objectve functon n (9). Note that C m y n (47) becomes 0 here. 5.3 Stoppng Condton From (48), a stoppng condton of the decomposton method can be max ϕ ɛ, where ɛ s the stoppng tolerance. The same stoppng condton can be used for the L2 case. 6 In practce, for faster convergence, at each cycle of l steps, they sequentally select ndces from a permutaton of {1,..., l}. 12

13 5.4 Extenson to Assgn Dfferent Regularzaton Parameters to Each Class In some applcatons, we may want to assgn dfferent regularzaton parameter C to class. Ths can be easly acheved by replacng C n earler dscusson wth C. 5.5 Solvng Problem (9) Versus Problem (24) In Secton 2.2, we mentoned an ssue of solvng problem (9) or problem (24). Based on the nvestgaton of decomposton methods so far, we gve some bref dscusson. Some works for structured SVM have solved the dual problem, where (24) s a specal case. For example, n Chang et al. (2010), a dual coordnate descent method s appled for solvng the dual problem of L2-loss structured SVM. Because (24) does not contan any lnear constrant, they are able to update a sngle ˆα m at a tme. 7 Ths settng s related to the decomposton method dscussed n Secton 3, although ours update k varables at a tme. If ˆα m s selected for update, the computatonal bottleneck s on calculatng for constructng a one-varable sub-problem. 8 the calculaton of ˆα j m K j, and j=1 w T δ(, m) = w T y x w T mx (49) From (11), Equaton (49) nvolves j=1 ˆα y j K j,. (50) The cost of 2l kernel evaluatons s O(ln) f each kernel evaluaton takes O(n). For our decomposton method to solve (9), to update k varables α m, m = 1,..., k, together, the number of kernel evaluatons s only l; see Equatons (27) and (29). More precsely, the complexty of Algorthm 1 to solve the sub-problem (28) s O(k log k + ln + kl), (51) where O(k log k) s for sortng B m, m y, and O(kl) s for obtanng B m, m = 1,..., k n Equaton (27). If k s not large, O(ln) s the domnant term n (51). Ths analyss ndcates that regardless of how many elements n α m, m = 1,..., k, are updated, we always need to calculate the -th kernel column K j,, j = 1,..., l. In ths regard, the decomposton method for problem (9) by solvng a sequence of sub-problems (28) ncely allows us to update as many varables as possble under a smlar number of kernel evaluatons. If kernel s not appled, nterestngly the stuaton becomes dfferent. The O(ln) cost of computng (50) s reduced to O(n) because w y and w m are avalable. If 7 Ths s not possble for the dual problem of L1-loss structured SVM. We have mentoned n Secton 2.2 that t contans a lnear constrant. 8 We omt detals because the dervaton s smlar to that for dervng the sub-problem (28). 13

14 Algorthm 1 s used, from (30), the complexty n (51) for updatng k varables becomes O(k log k + kn). For updatng an ˆα m by (49), the cost s O(n). Therefore, f log k < n, the cost of updatng α m, m = 1,..., k, together s k tmes of updatng a sngle varable. Then, the decomposton method for solvng problem (9) and sub-problem (28) may not be better than a coordnate descent method for solvng problem (24). Note that we have focused on the cost per sub-problem, but there are many other ssues such as the convergence speed (.e., the number of teratons). Memory access also affects the computatonal tme. For the coordnate descent method to update a varable ˆα m, the correspondng w m, x, and ˆα m must be accessed. In contrast, the approach of solvng sub-problem (28) accesses data and varables more systematcally. An mportant future work s to conduct a serous comparson and dentfy the better approach. 6 Experments In ths secton, we compare the proposed method for L2 loss wth an exstng mplementaton for L1 loss. We check lnear as well as kernel mult-class SVMs. Moreover, a comparson of senstvty to parameters s also conducted. Our mplementaton s extended from those n LIBLINEAR (Fan et al., 2008) and BSVM (Hsu and Ln, 2002), whch respectvely nclude solvers for lnear and kernel L1-loss Crammer and Snger mult-class SVM. Programs for experments n ths paper are avalable at codes.zp. All data sets used are avalable at Lnear Mult-class SVM We check both tranng tme and test accuracy of usng L1 and L2 losses. We consder the four data sets used n Keerth et al. (2008): news20, MNIST, sector and rcv1. We select the regularzaton parameter C by checkng fve-fold crossvaldaton (CV) accuracy of usng values n {2 5, 2 4,..., 2 5 }. The stoppng tolerance s ɛ = 0.1. The detals of the data sets are lsted n Table 1, and the experment results can be found n Table 2 The accuracy values are comparable. One may observe that the tranng tme of usng L1 loss s less. Ths result s opposte to that of bnary classfcaton; see experments n Hseh et al. (2008). In bnary classfcaton, when C approaches zero, the Hessan matrx of L2-loss SVM s close to the matrx I/(2C), where I s the dentty matrx. Thus, the optmzaton problem s easer to solve. However, for Crammer and Snger s mult-class SVM, when C approaches zero, only l of the Hessan s kl dagonal elements become close to 1/(2C). Ths may be the reason why for mult-class SVM, usng L2 loss does not lead to faster tranng tme. 14

15 Table 1: Data sets for experments of lnear mult-class SVMs. n s the number of features and k s the number of classes. data set #tranng #testng n k C for L1 loss C for L2 loss news20 15, 395 3, , MNIST 60, , sector 6, 412 3, , rcv1 15, , , Table 2: Lnear mult-class SVMs: we compare tranng tme (n seconds) and test accuracy between L1 loss and L2 loss. L1 loss L2 loss data set tranng tme test accuracy tranng tme test accuracy news % % MNIST % % sector % % rcv % % 6.2 Kernel Mult-class SVM We use the same data sets and the same procedure n Hsu and Ln (2002) to compare test accuracy, tranng tme and sparsty (.e., percentage of tranng data as support vectors) of usng L1 and L2 losses. We use the RBF kernel K(x, x j ) = e γ x x j 2. We fx the cache sze for the kernel matrx as 2048 MB. The stoppng tolerance s set to be ɛ = n all data sets except letter and shuttle, whose stoppng tolerance s ɛ = 0.1 for avodng lengthy tranng tme. The data set descrpton s n Table 3 and the results are lsted n Table 4. For dna, satmage, letter and shuttle, both tranng and test sets are avalable. We follow Hsu and Ln (2002) to splt the tranng data to 70% tranng and 30% valdaton for fndng parameters among C = {2 2, 2 1,..., 2 12 } and γ = {2 10, 2 9,..., 2 4 }. We then tran the whole tranng set by the best parameters and report the test accuracy and the model sparsty. For the rest data sets whose test sets are not avalable, we report the best ten-fold CV accuracy and the model sparsty. 9 From Table 4, we can see that L2-loss mult-class SVM gves comparable accuracy to L1-loss SVM. Note that the accuracy and the parameters of L1-loss mult-class SVM on some data sets are slghtly dfferent from those n Hsu and Ln (2002) because of the random data segmentaton n the valdaton procedure and the dfferent versons of the BSVM code. Tranng tme and sparsty are very dfferent between usng L1 and L2 losses because they hghly depend on the parameters used. To remove the effect of dfferent parameters, n Secton 6.3, we present the average result over a set of parameters. 9 The sparsty s the average of the 10 models n the CV procedure. 15

16 Table 3: Data sets for experments of kernel mult-class SVMs. n s the number of features and k s the number of classes. data set #tranng #testng n k (C,γ) for L1 loss (C,γ) for L2 loss rs (2 3, 2 5 ) (2 10, 2 7 ) wne (2 0, 2 2 ) (2 1, 2 3 ) glass (2 1, 2 3 ) (2 1, 2 3 ) vowel (2 2, 2 1 ) (2 4, 2 1 ) vehcle (2 7, 2 3 ) (2 5, 2 4 ) segment 2, (2 3, 2 3 ) (2 7, 2 0 ) dna 2, 000 1, (2 1, 2 6 ) (2 1, 2 6 ) satmage 4, 435 2, (2 2, 2 2 ) (2 4, 2 2 ) letter 15, 000 5, (2 4, 2 2 ) (2 11, 2 4 ) shuttle 43, , (2 10, 2 4 ) (2 9, 2 4 ) *: ɛ = 0.1 s used. 6.3 Senstvty to Parameters Parameter selecton s a tme consumng process. To avod checkng many parameters, we hope a method s not senstve to parameters. In ths secton, we compare the senstvty of L1 and L2 losses by presentng the average performance over a set of parameters. For lnear mult-class SVM, 11 values of C are selected: {2 5, 2 4,..., 2 5 }, and we present the average and the standard devaton of tranng tme and test accuracy. The results are n Table 5. For the kernel case, We pck C and γ from the two sets {2 1, 2 2, 2 5, 2 8 } and {2 6, 2 3, 2 0, 2 3 }, respectvely, so 16 dfferent results are generated. 10 We then report average and standard devaton n Table 6. From Tables 5 and 6, L2 loss s worse than L1 loss on the average tranng tme and sparsty. The hgher percentage of support vectors s the same as the stuaton n bnary classfcaton because the squared hnge loss leads to many small but nonzero α m. Interestngly, the average performance (test or CV accuracy) of L2 loss s better. Therefore, f usng L2 loss, t may be easer to locate a good parameter settng. We fnd that the same stuaton occurs n bnary classfcaton, although ths result was not clearly mentoned n prevous studes. An nvestgaton shows that L2 loss gves better accuracy when C s small. In ths stuaton, both L1- and L2-loss SVM suffer from the underfttng of tranng data. However, because L2 loss gves a hgher penalty than L1 loss, underfttng s less severe. 6.4 Summary of Experments Based on the experments, we have the followng fndngs. 10 We use a subset of (C, γ) values n Secton 6.2 to save the runnng tme. To report the average tranng tme, we must run all jobs n the same machne. In contrast, several machnes were used n Secton 6.2 to obtan CV accuracy of all parameters. 16

17 Table 4: Kernel mult-class SVMs: we compare tranng tme (n seconds), test or CV accuracy, and sparsty between L1 loss and L2 loss. nsv represents the percentage of tranng data that are support vectors. L1 loss L2 loss tranng test or CV tranng test or CV data set tme accuracy nsv tme accuracy nsv rs % 37.33% % 16.67% wne % 28.65% % 33.96% glass % 80.01% % 98.91% vowel % 67.93% % 72.31% vehcle % 53.73% % 65.29% segment % 46.65% % 19.08% dna % 46.90% % 56.10% satmage % 60.41% % 60.92% letter % 42.56% % 78.56% shuttle % 0.66% % 1.41% *: ɛ = 0.1 s used. Table 5: Senstvty to parameters: lnear mult-class SVMs. We present average±standard devaton. L1 loss L2 loss data set tranng tme test accuracy tranng tme test accuracy news ± ± 1.33% 3.59 ± ± 0.41% MNIST ± ± 0.20% ± ± 0.21% sector 5.96 ± ± 0.77% 7.66 ± ± 0.46% rcv ± ± 2.42% 3.83 ± ± 0.89% 1. If usng the best parameter, L2 loss gves comparable accuracy to L1 loss. For the tranng tme and the number of support vectors, L2 loss s better for some problems, but worse for some others. The stuaton hghly depends on the chosen parameter. 2. If we take the whole procedure of parameter selecton nto consderaton, L2 loss s worse than L1 loss on tranng tme and sparsty. However, the regon of sutable parameters s larger. Therefore, we can check fewer parameters f usng L2 loss. 7 Conclusons Ths paper extends Crammer and Snger s mult-class SVM to apply L2 loss. We gve detaled dervatons and dscuss some nterestng dfferences from the L1 case. Our results serve as a useful reference for those who ntend to use Crammer and Snger s method wth L2 loss. Fnally, we have extended the software BSVM (after 17

18 Table 6: Senstvty to parameters: kernel mult-class SVMs. We present average±standard devaton. nsv represents the percentage of tranng data that are support vectors. L1 loss data set tranng tme test or CV accuracy nsv rs 0.10 ± ± 6.93% ± 17.74% wne 0.09 ± ± 0.95% ± 31.15% glass 5.57 ± ± 5.24% ± 8.74% vowel ± ± 15.12% ± 13.17% vehcle ± ± 5.99% ± 16.68% segment ± ± 2.91% ± dna 6.29 ± ± 7.85% ± 20.68% satmage ± ± 2.83% ± 19.66% letter ± ± 9.18% ± 18.30% shuttle ± ± 1.89% 8.07 ± 6.63% L2 loss rs 0.18 ± ± 2.12% ± 25.68% wne 0.11 ± ± 1.11% ± 30.97% glass ± ± 3.40% ± 9.61% vowel ± ± 12.15% ± 12.72% vehcle ± ± 4.91% ± 16.52% segment ± ± 2.07% ± 20.86% dna ± ± 7.73% ± 19.72% satmage ± ± 2.21% ± 18.89% letter ± ± 7.24% ± 21.32% shuttle ± ± 1.62% ± 11.00% *: ɛ = 0.1 s used. verson 2.07) to nclude the proposed mplementaton. Acknowledgment Ths work was supported n part by the Natonal Scence Councl of Tawan va the grant E MY3. The authors thank the anonymous revewers and Mng-We Chang for valuable comments. We also thank Yong Zhuang and We-Sheng Chn for ther help n fndng errors of ths paper. References Bernhard E. Boser, Isabelle Guyon, and Vladmr Vapnk. A tranng algorthm for optmal margn classfers. In Proceedngs of the Ffth Annual Workshop on Computatonal Learnng Theory, pages ACM Press, Mng-We Chang, Vvek Srkumar, Dan Goldwasser, and Dan Roth. Structured 18

19 output learnng wth ndrect supervson. In Proceedngs of the Twenty Seven Internatonal Conference on Machne Learnng (ICML), pages , Corna Cortes and Vladmr Vapnk. Support-vector network. Machne Learnng, 20: , Koby Crammer and Yoram Snger. On the algorthmc mplementaton of multclass kernel-based vector machnes. Journal of Machne Learnng Research, 2: , Koby Crammer and Yoram Snger. On the learnablty and desgn of output codes for multclass problems. Machne Learnng, (2 3): , Rong-En Fan, Ka-We Chang, Cho-Ju Hseh, Xang-Ru Wang, and Chh-Jen Ln. LIBLINEAR: A lbrary for large lnear classfcaton. Journal of Machne Learnng Research, 9: , URL ~cjln/papers/lblnear.pdf. Cho-Ju Hseh, Ka-We Chang, Chh-Jen Ln, S. Sathya Keerth, and Sellamanckam Sundararajan. A dual coordnate descent method for large-scale lnear SVM. In Proceedngs of the Twenty Ffth Internatonal Conference on Machne Learnng (ICML), URL cddual.pdf. Chh-We Hsu and Chh-Jen Ln. A comparson of methods for mult-class support vector machnes. IEEE Transactons on Neural Networks, 13(2): , S. Sathya Keerth, Sellamanckam Sundararajan, Ka-We Chang, Cho-Ju Hseh, and Chh-Jen Ln. A sequental dual method for large scale mult-class lnear SVMs. In Proceedngs of the Forteenth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, pages , URL http: // Ioanns Tsochantards, Thorsten Joachms, Thomas Hofmann, and Yasemn Altun. Large margn methods for structured and nterdependent output varables. Journal of Machne Learnng Research, 6: , A Solvng the Sub-problems when A 0 Our decomposton method only solves the sub-problem when A > 0. To cover the case when K s not a vald kernel and K, s any possble value, we stll need to solve the sub-problems when A 0. 19

20 A.1 A = 0 When A = 0, for L1 loss, the sub-problem (26) reduces to a lnear programmng problem. Defne m arg max m:m y B m, then the optmal soluton s α m = 0, m = 1,..., k f B y B m 0, α y = C α m = α y f B y B m < 0. = 0, m y and m m, α m It s more complcated n the L2-loss case, because there s a quadratc term of α y. To solve the sub-problem (28), we reformulate t by the followng procedure. From footnote 2, we know α y 0. For any fxed α y, the sub-problem becomes mn α m,m y subject to B m α m m:m y = α y, m:m y α m α m 0, m y. Clearly, the soluton s α m = { α y f m = m, 0 otherwse. (52) Therefore, the sub-problem (28) s reduced to the followng one-varable problem. y (α mn α y 0 4C )2 + (B y B m )α y. (53) The soluton of (53) s α y = max ( 0, 2(B y B m )C ). (54) Usng (52) and (54), the optmal soluton can be wrtten as α m = 0, m = 1,..., k f B y B m 0, α y = 2(B y B m )C α m = α y f B y B m < 0. = 0, m y and m m α m 20

21 A.2 A < 0 For any gven α y that satsfes ther correspondng constrants, both (26) and (28) are equvalent to mn α m,m y subject to When A < 0, t s equvalent to max α m,m y subject to 2 A( (α m + B m A )2) = α y, m y 1 m y α m α m 0, m y. (α m ) 2 + m y m y α m m y 2 B m A αm = α y, (55) α m 0, m y. (56) By constrants (55) and (56), we have (α m ) 2 ( α m ) 2 = ( α y )2, and m y m y B m A αm ( B m α m ) = B m A A αy. m y m y Note that when A < 0, B m m = arg max B m = arg mn m:m y m:m y A. Thus clearly the optmal soluton s { α m α y f m = m, = 0 otherwse. Sub-problem (26) s then reduced to the followng one-varable problem. mn A(α y C α y 0 )2 + (B y B m )α y = A(α y Its soluton s α y = { 0 f By B m C, 2A 2 C otherwse. + B y B m ) 2 + constants. 2A Combne them together, the optmal soluton of (26) when A < 0 s α m = 0, m = 1,..., k f B y B m AC, α y = C α m = α y f B y B m < AC. (57) = 0, m y and m m α m 21

22 When A = 0, AC = 0. Therefore, (57) can be used n L1 loss for A 0. For problem (28), when A < 0, t s reduced to another one-varable problem. mn A (α y α y 0 )2 + (B y B m )α y, (58) where A A+ 1. If 4C A = 0, then (58) reduces to a trval problem wth optmal soluton { α y 0 f B y B m 0, = f B y B m < 0. Thus the optmal soluton of (28) when A = 0 s α m = 0, m = 1,..., k f B y B m 0, α y = α m = f B y B m < 0. = 0, m y and m m, α m If A 0, (58) s equvalent to mn A (α y α y 0 + B y B m ) 2. 2A When A < 0, the optmal soluton of (28) s α y =, α m =, and α m = 0, m y and m m. Whle f A > 0, A < 0, the optmum occurs at α m = 0, m = 1,..., k f B y B m 0, α y = (B y B m ) 2A α m = α y f B y B m < 0. (59) = 0, m y and m m α m Note that when A = 0, 1/2A = 2C. Thus (59) can be used n L2 loss for A > 0, A 0. 22

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed