A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM
|
|
- Brett Allen
- 5 years ago
- Views:
Transcription
1 A Study on L2-Loss (Squared Hnge-Loss) Mult-Class SVM Chng-Pe Lee and Chh-Jen Ln Department of Computer Scence, Natonal Tawan Unversty, Tape 10617, Tawan Keywords: Support vector machnes, Mult-class classfcaton, Squared hnge loss, L2 loss. Abstract Crammer and Snger s method s one of the most popular mult-class SVMs. It consders L1 loss (hnge loss) n a complcated optmzaton problem. In SVM, squared hnge loss (L2 loss) s a common alternatve to L1 loss, but surprsngly we have not seen any paper studyng detals of Crammer and Snger s method usng L2 loss. In ths note, we conduct a thorough nvestgaton. We show that the dervaton s not trval and has some subtle dfferences from the L1 case. Detals provded n ths work can be a useful reference for those who ntend to use Crammer and Snger s method wth L2 loss. They do not need a tedous process to derve everythng by themselves. Further, we present some new results/dscusson for both L1- and L2-loss formulatons. 1 Introducton Support Vector Machnes (SVM) (Boser et al., 1992; Cortes and Vapnk, 1995) were orgnally desgned for bnary classfcaton. In recent years, many approaches have been proposed to extend SVM to handle mult-class classfcaton problems; see, for example, a detaled comparson n Hsu and Ln (2002). Among these works, the method proposed by Crammer and Snger (2001, 2002) has been wdely used. They extend the optmzaton problem of L1-loss (hnge-loss) SVM to a mult-class formula. In bnary classfcaton, squared hnge loss (L2 loss) s a common alternatve to L1 loss, but surprsngly we have not found any paper studyng detals of Crammer and Snger s method wth L2 loss. Ths s nconvenent because, for example, we do not know what the dual problem s. 1 Although the dual problem of two-class SVM usng L2 loss s well known, t cannot be drectly extended to the mult-class case. In fact, the dervaton s non-trval and has some subtle dfferences from the L1 case. Also, the algorthm to solve the dual problem (for both kernel and lnear stuatons) must be modfed. We thnk there s a need to gve all the detals for future references. Then those who ntend to use Crammer and 1 Indeed, the dual problem has been provded n some works of structured SVM, where Crammer and Snger s mult-class SVM s a specal case. However, ther form can be smplfed for mult-class SVM; see a detaled dscusson n Secton 2.2.
2 Snger s method wth L2 loss do not need a tedous procedure to derve everythng by themselves. In addton to the man contrbuton to nvestgate L2-loss mult-class SVM, we present some new results for both L1- and L2-loss cases. Frst, we dscuss the dfferences between the dual problem of Crammer and Snger s mult-class SVM and that of structured SVM, although we focus more on the comparson of L2-loss formulaton. Second, we gve a smpler dervaton for solvng the sub-problem n the decomposton method to mnmze the dual problem. Ths paper s organzed as follows. In Secton 2, we ntroduce L2-loss multclass SVM and derve ts dual problem. We dscuss the connecton to structured SVM, whch s a generalzaton of Crammer and Snger s mult-class SVM. Then n Secton 3, we extend a decomposton method to solve the optmzaton problem. In partcular, we obtan the sub-problem to be solved at each teraton. A procedure to fnd the soluton of the sub-problem s gven n Secton 4. Our dervaton and proof are smpler than Crammer and Snger s. In Secton 5, we dscuss some mplementaton ssues and extensons. Experments n Secton 6 compare the performance of L1-loss and L2-loss mult-class SVM usng both lnear and nonlnear kernels. Secton 7 then concludes ths paper. 2 Formulaton Gven a set of nstance-label pars (x, y ), x R n, y {1,..., k}, = 1,..., l, Crammer and Snger (2002) proposed a mult-class SVM approach by solvng the followng optmzaton problem. mn w 1,...,w k,ξ 1 2 w T mw m + C subject to w T y x w T mx e m ξ, (1) ξ = 1,..., l, m = 1,..., k, where e m = { 0 f y = m, 1 f y m. Note that f y = m, the constrant s the same as ξ 0. The decson functon for predctng the label of an nstance x s (2) arg max,...,k wt mx. 2
3 The dual problem of (1) s where subject to 1 mn α 2 j=1 K,j α m αj m + α m e m α m = 0, = 1,..., l, (3) α m 0, = 1,..., l, m = 1,..., k, m y, (4) α y C, = 1,..., l, (5) α = [α 1 1,..., α k 1,..., α 1 l,..., α k l ] T and K,j = x T x j. (6) Constrants (4) and (5) are often combned as { α m Cy m, where Cy m 0 f y m, = C f y = m. We separate them n order to compare wth the dual problem of usng L2 loss. After solvng (3), one can compute the optmal w m by w m = α m x, m = 1,..., k. (7) In ths paper, we extend problem (1) to use L2 loss. By changng the loss term from ξ to ξ 2, the prmal problem becomes mn w 1,...,w k,ξ 1 2 w T mw m + C subject to w T y x w T mx e m ξ, (8) ξ 2 = 1,..., l, m = 1,..., k. The constrant w T y x w T mx e m ξ when m = y can be removed because for L2 loss, ξ 0 holds at an optmum wthout ths constrant. We keep t here n order to compare wth the formulaton of usng L1 loss. We wll derve the followng dual problem. mn α subject to f(α) α m = 0, = 1,..., l, (9) α m 0, = 1,..., l, m = 1,..., k, m y, where f(α) = 1 2 K,j α m αj m + j=1 3 α m e m + (α y )2 4C.
4 Problem (9) s smlar to (3), but t possesses an addtonal quadratc term n the objectve functon. Further, the constrant on α y s dfferent. In (5), α y C, but n (9), α y s unconstraned. 2 We dscuss two methods to derve the dual problem. The frst s from a drect calculaton, whle the second follows from the dervaton of structured SVM. 2.1 A Drect Calculaton to Obtan the Dual Problem The Lagrange functon of (8) s L(w 1,..., w k, ξ, ˆα) = 1 2 w T mw m + C ξ 2 ˆα m (w T y x w T mx e m + ξ ), where ˆα m 0, m = 1,..., k, = 1,..., l, are Lagrange multplers. The Lagrange dual problem s ( max nf L(w 1,..., w k, ξ, ˆα) ). ˆα: ˆα m 0,,m w 1,...,w k,ξ To mnmze L under fxed ˆα, we rewrte the followng term n the Lagrange functon and have ˆα m w T y x = wm L = 0 w m ξ L = 2Cξ :y =m s=1 We smplfy (11) by defnng α m (1 e m ) ˆα s w T y x = ( (1 e m ) s=1 ˆα m = 0 ξ = w T m (1 e m ) ˆα s x, (10) s=1 ) ˆα s ˆα m x = 0, m = 1,..., k, (11) k ˆαm 2C, = 1,..., l. (12) ˆα s ˆα m, = 1,..., l, m = 1,..., k. (13) s=1 Ths defnton s equvalent to α m = ˆα m, m y, (14) α y = ˆα m = α m. (15) m:m y m:m y 2 Indeed, usng k αm = 0 and α m 0, m y, we have α y dual problems of L1 and L2 cases. 4 0 for both
5 Therefore, we can rewrte the soluton of mnmzng L under fxed ˆα as w m = α m x, m = 1,..., k, (16) ξ = ˆαy + α y 2C, = 1,..., l. (17) By (2), (10), (11), and (14)-(17), the Lagrange dual functon s L(w 1,..., w k, ξ, ˆα) = 1 (w 2 m) T w m = 1 2 C (ξ ) 2 + ˆα m ( (w y ) T x (w m) T x ) ˆα m e m (18) ( (w m) T w m (w m) T (1 e m ) = 1 2 = 1 2 = 1 2 ) (w m) T ˆα m x C (w m) T w m C α m x 2 j=1 (ξ ) 2 + (ξ ) 2 + K,j α m αj m (ˆα y + α y )2 4C α m e m ˆα s x s=1 ˆα m e m ˆα m e m α m e m (ˆα y + α y 4C )2. (19) Because ˆα m, m y do not appear n (19), from (14) and (15), the dual problem s mn α, ˆα subject to 1 2 j=1 K,j α m αj m + α m e m + (ˆα y + α y 4C )2 ˆα y 0, = 1,..., l, α m = 0, = 1,..., l, (20) α m 0, = 1,..., l, m = 1,..., k, m y. (21). Because (20) and (21) mply α y the optmal ˆα y problem s complete. 0 and ˆα y must be zero. After removng ˆα y appears only n the term (ˆα y +α y )2,, the dervaton of the dual 5
6 We dscuss the dfference from L1 loss. If L1 loss s used, (12) becomes ˆα m = C, (22) and C l (ξ ) 2 n (18) dsappears. Equaton (22) and the fact ˆα y 0 lead to the constrant (5) as follows. α y = m:m y ˆα m = C ˆα y C. For L2 loss, wthout the condton n (22), α y s unconstraned. 2.2 Usng Structured SVM Formulaton to Obtan the Dual Problem It s well known that Crammer and Snger s mult-class SVM s a specal case of structured SVM (Tsochantards et al., 2005). By defnng w w 1. w k R kn 1 and δ(, m) R kn 1, wth δ(, m) 0 x 0 x 0 y -th poston m-th poston, f y m, and 0 f y = m, problem (8) can be wrtten as mn w,ξ 1 2 w 2 + C ξ 2 subject to w T δ(, m) e m ξ, (23) = 1,..., l, m = 1,..., k. Ths problem s n a smlar form to L2-loss bnary SVM, so the dervaton of the dual problem s straght forward. Followng Tsochantards et al. (2005), the dual problem s 3 3 Problem (24) s slghtly dfferent from that n Tsochantards et al. (2005) because they remove the constrants ξ 0, by settng m y n (23). 6
7 1 mn ˆα 2 s=1 j=1 ˆα m e m + δ(, m) T δ(j, s)ˆα m ˆα s j ( k ˆαm ) 2 4C subject to ˆα m 0, = 1,..., l, m = 1,..., k. (24) Also, at an optmal soluton, we have w = ˆα m δ(, m) and ξ = k ˆαm 2C. (25) Problem (24) seems to be very dfferent from problem (9) obtaned n Secton 2.1. In fact, problem (24) s an ntermedate result n our dervaton. A careful check shows 1. ˆα s the same as the Lagrange multpler used n Secton w n (25) s the same as that n (7); see Equaton (11). In Secton 2.1, we ntroduce a new varable α and smplfes the two terms δ(, m) T δ(j, s)ˆα m ˆα j s and s=1 j=1 ( ˆα m ) 2 to j=1 K,j α m αj m and (α y )2 4C, respectvely. An advantage of problem (9) s that K,j = x T x j explctly appears n the objectve functon. In contrast, δ(, m) T δ(j, m) does not reveal detals of the nner product between nstances. However, a caveat of (9) s that t contans some lnear constrants. An nterestng queston s whether the smplfcaton from (24) to (9) allows us to apply a smpler or more effcent optmzaton algorthm. Ths ssue already occurs for usng L1 loss because we can ether solve problem (3) or a form smlar to (24). However, the dual problem of L1-loss structured SVM contans a lnear constrant, but problem (24) does not. 4 Therefore, for the L1 case, t s easy to see that the smplfed form (3) should be used. However, for L2 loss, problem (24) possesses an advantage of beng a bound-constraned problem. We wll gve some dscusson about solvng (9) or (24) n Secton 5.5. In all remanng places we focus on problem (9) because exstng mplementatons for the L1-loss formulaton all solve the correspondng problem (3). 4 See Proposton 5 n Tsochantards et al. (2005). 7
8 3 Decomposton Method and Sub-problem Decomposton methods are currently the major method to solve the dual problem (3) of the L1 case (Crammer and Snger, 2002; Keerth et al., 2008). At each teraton, the k varables α 1,..., α k assocated wth an nstance x are selected for updatng, whle other varables are fxed. For (3), the followng sub-problem s solved. mn α 1,...,αk subject to ( 1 2 A(αm ) 2 + B m α m ) α m = 0, (26) α m C m y, m = 1,..., k, where A = K, and B m = K j, ᾱj m + e m Aᾱ m. (27) In (27), ᾱ s the soluton obtaned n the prevous teraton. We defer the dscusson on the selecton of the ndex n Secton 5. For problem (9), we show that the sub-problem s: mn α 1,...,αk subject to j=1 ( 1 2 A(αm ) 2 + B m α m ) + (αy )2 4C α m = 0, (28) α m 0, m = 1,..., k, m y, where A and B m are the same as (27). The dervaton of (28) s as follows. Because all elements except α 1,..., α k are fxed, the objectve functon of (9) becomes = 1( K, (α m ) ) K,j ᾱj m α m + α m e m 2 j:j (1 2 K,(α m ) 2 + ( K j, ᾱj m + e m K, ᾱ m )α m j=1 + (αy )2 4C + constants ) + (α y )2 4C + constants. (29) Equaton (29) then leads to the objectve functon of (28), whle the constrants are drectly obtaned from (9). Note that B m = w T mx + e m K, ᾱ m (30) 8
9 f are mantaned. 5 w m = ᾱ m x, m = 1,..., k 4 Solvng the Sub-problem We dscuss how to solve the sub-problem when A > 0. If A = 0, then x = 0. Thus ths nstance gves a constant value ξ = 1 to the prmal objectve functon (8), and the value of α m, m = 1,..., k have no effect on w m defned n (16), so we can skp solvng the sub-problem. We follow the approach by Crammer and Snger to solve the sub-problem, although there are some nterestng dfferences. Ther method frst computes D m = B m + AC m y, m = 1,..., k. Then t starts wth a set Φ = φ and sequentally adds one ndex m to Φ by the decreasng order of D m untl the followng nequalty s satsfed. β = AC + m Φ D m Φ max m/ Φ D m. (31) The optmal soluton of (26) s computed by: α m = mn(cy m, β B m ), m = 1,..., k. (32) A Crammer and Snger gave a lengthy proof to show the correctness of ths method. Our contrbuton here s to derve the algorthm and prove ts correctness by easly analyzng the KKT optmalty condton. We now derve an algorthm for solvng (28). Let us defne A y A + 1/2C. The KKT condtons of (28) ndcate that there are scalars β, ρ m, m = 1,..., k, such that α m = 0, (33) α m 0, m y, (34) ρ m α m = 0, ρ m 0, m y, (35) Aα m + B m β = ρ m, m y, (36) A y α y + B y β = 0. (37) 5 See detals of solvng lnear Crammer and Snger s mult-class SVM n Keerth et al. (2008). 9
10 Usng (34), Equatons (35) and (36) are equvalent to Aα m + B m β = 0, f α m < 0, m y, (38) Aα m + B m β = B m β 0, f α m = 0, m y. (39) Now KKT condtons become (33)-(34), (37), and (38)-(39). If β s known, we prove that { mn(0, β B m α m A ) f m y, β B y (40) A y f m = y, satsfes all KKT condtons except (33). Clearly, the way to get α m n (40) ensures α m 0, m y, so (34) holds. From (40), when β < B m, we have α m < 0 and β B m = Aα m. Thus, (38) s satsfed. Otherwse, β B m and α m = 0 satsfy (39). Also notce that α y s drectly obtaned from (37). The remanng task s how to fnd β such that (33) holds. From (33), (37), and (38) we obtan A y (β B m ) + A(β B y ) = 0. m:α m <0 Hence, β = AB y + A y A + m:α m <0 B m m:α m <0 A y = A A y B y + m:α m <0 B m A A y + {m α m < 0}. (41) Combnng (41) and (39), we begn wth a set Φ = φ, and then sequentally add one ndex m to Φ by the decreasng order of B m, m = 1,..., k, m y, untl h = A A y B y + m Φ B m A A y + Φ max m/ Φ B m. (42) Let β = h when (42) s satsfed. Algorthm 1 lsts the detals for solvng the sub-problem (28). To prove (33), t s suffcent to show that β and α m, m, obtaned by Algorthm 1 satsfes (41). Ths s equvalent to showng that the set Φ of ndces ncluded n step 5 of Algorthm 1 satsfes Φ = {m α m < 0}. From (40), we prove the followng equvalent result. β < B m, m Φ and β B m, m / Φ. (43) The second nequalty mmedately follows from (42). For the frst, assume t s the last element added to Φ. When t s consdered, (42) s not satsfed yet, so A A y B y + B m m Φ\{t} A A y + Φ 1 10 < B t. (44)
11 ALGORITHM 1: Solvng the sub-problem 1. Gven A, A y, and B = {B 1,..., B k }. 2. D B 3. Swap D 1 and D y, then sort D \ {D 1 } n decreasng order. 4. r 2, β D 1 A/A y 5. Whle r k and β/(r 2 + A/A y ) < D r 5.1. β β + D r 5.2. r r β β/(r 2 + A/A y ) 7. α m mn ( 0, (β B m )/A ), m y 8. α y (β B y )/A y Usng (44) and the fact that elements n Φ are added n the decreasng order of B m, A B y + B m = A B y + A y A y m Φ m Φ\{t} < ( Φ 1 + A A y )B t + B t = ( Φ + A A y )B t ( Φ + A A y )B s, s Φ. B m + B t Thus, we have the frst nequalty n (43). Wth all KKT condtons satsfed, Algorthm 1 obtans an optmal soluton of (28). By comparng (31), (32) and (42), (40) respectvely, we can see that the procedures for L1 loss and L2 loss are smlar but dfferent n several aspects. In partcular, because α y s unconstraned, B y s consdered dfferently from other B m s n (42). 5 Other Issues and Extensons In ths secton, we dscuss other detals of the decomposton method. Some of them are smlar to those for the L1 case. We also extend problems (8)-(9) to more general settngs. In the end we dscuss advantages and dsadvantages of solvng two dual forms (9) and (24). 5.1 Extensons to Use Kernels It s straghtforward to extend our algorthm to use kernels. The only change s to replace K,j = x T x j 11
12 n (6) wth K,j = φ(x ) T φ(x j ), (45) where φ(x) s a functon mappng data to a hgher dmensonal space. 5.2 Workng Set Selecton We mentoned n Secton 3 that at each teraton of the decomposton method, an ndex s selected so that α 1,..., α k are updated. Ths procedure s called workng set selecton. If kernels are not used, we follow Keerth et al. (2008) to sequentally select {1,..., l}. 6 For lnear SVM, t s known that more sophstcated selectons such as usng gradent nformaton may not be cost-effectve; see the detaled dscusson n Secton 4.1 of Hseh et al. (2008). For kernel SVM, we can use gradent nformaton for workng set selecton because the cost s relatvely low compared to that of kernel evaluatons. In Crammer and Snger (2001), to solve problems wth L1 loss, they select an ndex by = arg max ϕ, (46) {1,...,l} where ϕ = max 1 m k ĝm mn ĝ m:α m m, (47) <Cm y and ĝ m, = 1,..., l, m = 1,..., k, are the gradent of (3) s objectve functon. The reason behnd (46) s that ϕ shows the volaton of the optmalty condton. Note that for problem (3), α s optmal f and only f α s feasble and ϕ = 0, = 1,..., l. (48) See the dervaton n Crammer and Snger (2001, Secton 5). For L2 loss, we can apply a smlar settng by where g m = j=1 ϕ = max 1 m k gm mn g m:α m <0 or m=y m, = 1,..., l, K,j αj m + e m + (1 e m ) αm, = 1,..., l, m = 1,..., k, 2C are the gradent of the objectve functon n (9). Note that C m y n (47) becomes 0 here. 5.3 Stoppng Condton From (48), a stoppng condton of the decomposton method can be max ϕ ɛ, where ɛ s the stoppng tolerance. The same stoppng condton can be used for the L2 case. 6 In practce, for faster convergence, at each cycle of l steps, they sequentally select ndces from a permutaton of {1,..., l}. 12
13 5.4 Extenson to Assgn Dfferent Regularzaton Parameters to Each Class In some applcatons, we may want to assgn dfferent regularzaton parameter C to class. Ths can be easly acheved by replacng C n earler dscusson wth C. 5.5 Solvng Problem (9) Versus Problem (24) In Secton 2.2, we mentoned an ssue of solvng problem (9) or problem (24). Based on the nvestgaton of decomposton methods so far, we gve some bref dscusson. Some works for structured SVM have solved the dual problem, where (24) s a specal case. For example, n Chang et al. (2010), a dual coordnate descent method s appled for solvng the dual problem of L2-loss structured SVM. Because (24) does not contan any lnear constrant, they are able to update a sngle ˆα m at a tme. 7 Ths settng s related to the decomposton method dscussed n Secton 3, although ours update k varables at a tme. If ˆα m s selected for update, the computatonal bottleneck s on calculatng for constructng a one-varable sub-problem. 8 the calculaton of ˆα j m K j, and j=1 w T δ(, m) = w T y x w T mx (49) From (11), Equaton (49) nvolves j=1 ˆα y j K j,. (50) The cost of 2l kernel evaluatons s O(ln) f each kernel evaluaton takes O(n). For our decomposton method to solve (9), to update k varables α m, m = 1,..., k, together, the number of kernel evaluatons s only l; see Equatons (27) and (29). More precsely, the complexty of Algorthm 1 to solve the sub-problem (28) s O(k log k + ln + kl), (51) where O(k log k) s for sortng B m, m y, and O(kl) s for obtanng B m, m = 1,..., k n Equaton (27). If k s not large, O(ln) s the domnant term n (51). Ths analyss ndcates that regardless of how many elements n α m, m = 1,..., k, are updated, we always need to calculate the -th kernel column K j,, j = 1,..., l. In ths regard, the decomposton method for problem (9) by solvng a sequence of sub-problems (28) ncely allows us to update as many varables as possble under a smlar number of kernel evaluatons. If kernel s not appled, nterestngly the stuaton becomes dfferent. The O(ln) cost of computng (50) s reduced to O(n) because w y and w m are avalable. If 7 Ths s not possble for the dual problem of L1-loss structured SVM. We have mentoned n Secton 2.2 that t contans a lnear constrant. 8 We omt detals because the dervaton s smlar to that for dervng the sub-problem (28). 13
14 Algorthm 1 s used, from (30), the complexty n (51) for updatng k varables becomes O(k log k + kn). For updatng an ˆα m by (49), the cost s O(n). Therefore, f log k < n, the cost of updatng α m, m = 1,..., k, together s k tmes of updatng a sngle varable. Then, the decomposton method for solvng problem (9) and sub-problem (28) may not be better than a coordnate descent method for solvng problem (24). Note that we have focused on the cost per sub-problem, but there are many other ssues such as the convergence speed (.e., the number of teratons). Memory access also affects the computatonal tme. For the coordnate descent method to update a varable ˆα m, the correspondng w m, x, and ˆα m must be accessed. In contrast, the approach of solvng sub-problem (28) accesses data and varables more systematcally. An mportant future work s to conduct a serous comparson and dentfy the better approach. 6 Experments In ths secton, we compare the proposed method for L2 loss wth an exstng mplementaton for L1 loss. We check lnear as well as kernel mult-class SVMs. Moreover, a comparson of senstvty to parameters s also conducted. Our mplementaton s extended from those n LIBLINEAR (Fan et al., 2008) and BSVM (Hsu and Ln, 2002), whch respectvely nclude solvers for lnear and kernel L1-loss Crammer and Snger mult-class SVM. Programs for experments n ths paper are avalable at codes.zp. All data sets used are avalable at Lnear Mult-class SVM We check both tranng tme and test accuracy of usng L1 and L2 losses. We consder the four data sets used n Keerth et al. (2008): news20, MNIST, sector and rcv1. We select the regularzaton parameter C by checkng fve-fold crossvaldaton (CV) accuracy of usng values n {2 5, 2 4,..., 2 5 }. The stoppng tolerance s ɛ = 0.1. The detals of the data sets are lsted n Table 1, and the experment results can be found n Table 2 The accuracy values are comparable. One may observe that the tranng tme of usng L1 loss s less. Ths result s opposte to that of bnary classfcaton; see experments n Hseh et al. (2008). In bnary classfcaton, when C approaches zero, the Hessan matrx of L2-loss SVM s close to the matrx I/(2C), where I s the dentty matrx. Thus, the optmzaton problem s easer to solve. However, for Crammer and Snger s mult-class SVM, when C approaches zero, only l of the Hessan s kl dagonal elements become close to 1/(2C). Ths may be the reason why for mult-class SVM, usng L2 loss does not lead to faster tranng tme. 14
15 Table 1: Data sets for experments of lnear mult-class SVMs. n s the number of features and k s the number of classes. data set #tranng #testng n k C for L1 loss C for L2 loss news20 15, 395 3, , MNIST 60, , sector 6, 412 3, , rcv1 15, , , Table 2: Lnear mult-class SVMs: we compare tranng tme (n seconds) and test accuracy between L1 loss and L2 loss. L1 loss L2 loss data set tranng tme test accuracy tranng tme test accuracy news % % MNIST % % sector % % rcv % % 6.2 Kernel Mult-class SVM We use the same data sets and the same procedure n Hsu and Ln (2002) to compare test accuracy, tranng tme and sparsty (.e., percentage of tranng data as support vectors) of usng L1 and L2 losses. We use the RBF kernel K(x, x j ) = e γ x x j 2. We fx the cache sze for the kernel matrx as 2048 MB. The stoppng tolerance s set to be ɛ = n all data sets except letter and shuttle, whose stoppng tolerance s ɛ = 0.1 for avodng lengthy tranng tme. The data set descrpton s n Table 3 and the results are lsted n Table 4. For dna, satmage, letter and shuttle, both tranng and test sets are avalable. We follow Hsu and Ln (2002) to splt the tranng data to 70% tranng and 30% valdaton for fndng parameters among C = {2 2, 2 1,..., 2 12 } and γ = {2 10, 2 9,..., 2 4 }. We then tran the whole tranng set by the best parameters and report the test accuracy and the model sparsty. For the rest data sets whose test sets are not avalable, we report the best ten-fold CV accuracy and the model sparsty. 9 From Table 4, we can see that L2-loss mult-class SVM gves comparable accuracy to L1-loss SVM. Note that the accuracy and the parameters of L1-loss mult-class SVM on some data sets are slghtly dfferent from those n Hsu and Ln (2002) because of the random data segmentaton n the valdaton procedure and the dfferent versons of the BSVM code. Tranng tme and sparsty are very dfferent between usng L1 and L2 losses because they hghly depend on the parameters used. To remove the effect of dfferent parameters, n Secton 6.3, we present the average result over a set of parameters. 9 The sparsty s the average of the 10 models n the CV procedure. 15
16 Table 3: Data sets for experments of kernel mult-class SVMs. n s the number of features and k s the number of classes. data set #tranng #testng n k (C,γ) for L1 loss (C,γ) for L2 loss rs (2 3, 2 5 ) (2 10, 2 7 ) wne (2 0, 2 2 ) (2 1, 2 3 ) glass (2 1, 2 3 ) (2 1, 2 3 ) vowel (2 2, 2 1 ) (2 4, 2 1 ) vehcle (2 7, 2 3 ) (2 5, 2 4 ) segment 2, (2 3, 2 3 ) (2 7, 2 0 ) dna 2, 000 1, (2 1, 2 6 ) (2 1, 2 6 ) satmage 4, 435 2, (2 2, 2 2 ) (2 4, 2 2 ) letter 15, 000 5, (2 4, 2 2 ) (2 11, 2 4 ) shuttle 43, , (2 10, 2 4 ) (2 9, 2 4 ) *: ɛ = 0.1 s used. 6.3 Senstvty to Parameters Parameter selecton s a tme consumng process. To avod checkng many parameters, we hope a method s not senstve to parameters. In ths secton, we compare the senstvty of L1 and L2 losses by presentng the average performance over a set of parameters. For lnear mult-class SVM, 11 values of C are selected: {2 5, 2 4,..., 2 5 }, and we present the average and the standard devaton of tranng tme and test accuracy. The results are n Table 5. For the kernel case, We pck C and γ from the two sets {2 1, 2 2, 2 5, 2 8 } and {2 6, 2 3, 2 0, 2 3 }, respectvely, so 16 dfferent results are generated. 10 We then report average and standard devaton n Table 6. From Tables 5 and 6, L2 loss s worse than L1 loss on the average tranng tme and sparsty. The hgher percentage of support vectors s the same as the stuaton n bnary classfcaton because the squared hnge loss leads to many small but nonzero α m. Interestngly, the average performance (test or CV accuracy) of L2 loss s better. Therefore, f usng L2 loss, t may be easer to locate a good parameter settng. We fnd that the same stuaton occurs n bnary classfcaton, although ths result was not clearly mentoned n prevous studes. An nvestgaton shows that L2 loss gves better accuracy when C s small. In ths stuaton, both L1- and L2-loss SVM suffer from the underfttng of tranng data. However, because L2 loss gves a hgher penalty than L1 loss, underfttng s less severe. 6.4 Summary of Experments Based on the experments, we have the followng fndngs. 10 We use a subset of (C, γ) values n Secton 6.2 to save the runnng tme. To report the average tranng tme, we must run all jobs n the same machne. In contrast, several machnes were used n Secton 6.2 to obtan CV accuracy of all parameters. 16
17 Table 4: Kernel mult-class SVMs: we compare tranng tme (n seconds), test or CV accuracy, and sparsty between L1 loss and L2 loss. nsv represents the percentage of tranng data that are support vectors. L1 loss L2 loss tranng test or CV tranng test or CV data set tme accuracy nsv tme accuracy nsv rs % 37.33% % 16.67% wne % 28.65% % 33.96% glass % 80.01% % 98.91% vowel % 67.93% % 72.31% vehcle % 53.73% % 65.29% segment % 46.65% % 19.08% dna % 46.90% % 56.10% satmage % 60.41% % 60.92% letter % 42.56% % 78.56% shuttle % 0.66% % 1.41% *: ɛ = 0.1 s used. Table 5: Senstvty to parameters: lnear mult-class SVMs. We present average±standard devaton. L1 loss L2 loss data set tranng tme test accuracy tranng tme test accuracy news ± ± 1.33% 3.59 ± ± 0.41% MNIST ± ± 0.20% ± ± 0.21% sector 5.96 ± ± 0.77% 7.66 ± ± 0.46% rcv ± ± 2.42% 3.83 ± ± 0.89% 1. If usng the best parameter, L2 loss gves comparable accuracy to L1 loss. For the tranng tme and the number of support vectors, L2 loss s better for some problems, but worse for some others. The stuaton hghly depends on the chosen parameter. 2. If we take the whole procedure of parameter selecton nto consderaton, L2 loss s worse than L1 loss on tranng tme and sparsty. However, the regon of sutable parameters s larger. Therefore, we can check fewer parameters f usng L2 loss. 7 Conclusons Ths paper extends Crammer and Snger s mult-class SVM to apply L2 loss. We gve detaled dervatons and dscuss some nterestng dfferences from the L1 case. Our results serve as a useful reference for those who ntend to use Crammer and Snger s method wth L2 loss. Fnally, we have extended the software BSVM (after 17
18 Table 6: Senstvty to parameters: kernel mult-class SVMs. We present average±standard devaton. nsv represents the percentage of tranng data that are support vectors. L1 loss data set tranng tme test or CV accuracy nsv rs 0.10 ± ± 6.93% ± 17.74% wne 0.09 ± ± 0.95% ± 31.15% glass 5.57 ± ± 5.24% ± 8.74% vowel ± ± 15.12% ± 13.17% vehcle ± ± 5.99% ± 16.68% segment ± ± 2.91% ± dna 6.29 ± ± 7.85% ± 20.68% satmage ± ± 2.83% ± 19.66% letter ± ± 9.18% ± 18.30% shuttle ± ± 1.89% 8.07 ± 6.63% L2 loss rs 0.18 ± ± 2.12% ± 25.68% wne 0.11 ± ± 1.11% ± 30.97% glass ± ± 3.40% ± 9.61% vowel ± ± 12.15% ± 12.72% vehcle ± ± 4.91% ± 16.52% segment ± ± 2.07% ± 20.86% dna ± ± 7.73% ± 19.72% satmage ± ± 2.21% ± 18.89% letter ± ± 7.24% ± 21.32% shuttle ± ± 1.62% ± 11.00% *: ɛ = 0.1 s used. verson 2.07) to nclude the proposed mplementaton. Acknowledgment Ths work was supported n part by the Natonal Scence Councl of Tawan va the grant E MY3. The authors thank the anonymous revewers and Mng-We Chang for valuable comments. We also thank Yong Zhuang and We-Sheng Chn for ther help n fndng errors of ths paper. References Bernhard E. Boser, Isabelle Guyon, and Vladmr Vapnk. A tranng algorthm for optmal margn classfers. In Proceedngs of the Ffth Annual Workshop on Computatonal Learnng Theory, pages ACM Press, Mng-We Chang, Vvek Srkumar, Dan Goldwasser, and Dan Roth. Structured 18
19 output learnng wth ndrect supervson. In Proceedngs of the Twenty Seven Internatonal Conference on Machne Learnng (ICML), pages , Corna Cortes and Vladmr Vapnk. Support-vector network. Machne Learnng, 20: , Koby Crammer and Yoram Snger. On the algorthmc mplementaton of multclass kernel-based vector machnes. Journal of Machne Learnng Research, 2: , Koby Crammer and Yoram Snger. On the learnablty and desgn of output codes for multclass problems. Machne Learnng, (2 3): , Rong-En Fan, Ka-We Chang, Cho-Ju Hseh, Xang-Ru Wang, and Chh-Jen Ln. LIBLINEAR: A lbrary for large lnear classfcaton. Journal of Machne Learnng Research, 9: , URL ~cjln/papers/lblnear.pdf. Cho-Ju Hseh, Ka-We Chang, Chh-Jen Ln, S. Sathya Keerth, and Sellamanckam Sundararajan. A dual coordnate descent method for large-scale lnear SVM. In Proceedngs of the Twenty Ffth Internatonal Conference on Machne Learnng (ICML), URL cddual.pdf. Chh-We Hsu and Chh-Jen Ln. A comparson of methods for mult-class support vector machnes. IEEE Transactons on Neural Networks, 13(2): , S. Sathya Keerth, Sellamanckam Sundararajan, Ka-We Chang, Cho-Ju Hseh, and Chh-Jen Ln. A sequental dual method for large scale mult-class lnear SVMs. In Proceedngs of the Forteenth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng, pages , URL http: // Ioanns Tsochantards, Thorsten Joachms, Thomas Hofmann, and Yasemn Altun. Large margn methods for structured and nterdependent output varables. Journal of Machne Learnng Research, 6: , A Solvng the Sub-problems when A 0 Our decomposton method only solves the sub-problem when A > 0. To cover the case when K s not a vald kernel and K, s any possble value, we stll need to solve the sub-problems when A 0. 19
20 A.1 A = 0 When A = 0, for L1 loss, the sub-problem (26) reduces to a lnear programmng problem. Defne m arg max m:m y B m, then the optmal soluton s α m = 0, m = 1,..., k f B y B m 0, α y = C α m = α y f B y B m < 0. = 0, m y and m m, α m It s more complcated n the L2-loss case, because there s a quadratc term of α y. To solve the sub-problem (28), we reformulate t by the followng procedure. From footnote 2, we know α y 0. For any fxed α y, the sub-problem becomes mn α m,m y subject to B m α m m:m y = α y, m:m y α m α m 0, m y. Clearly, the soluton s α m = { α y f m = m, 0 otherwse. (52) Therefore, the sub-problem (28) s reduced to the followng one-varable problem. y (α mn α y 0 4C )2 + (B y B m )α y. (53) The soluton of (53) s α y = max ( 0, 2(B y B m )C ). (54) Usng (52) and (54), the optmal soluton can be wrtten as α m = 0, m = 1,..., k f B y B m 0, α y = 2(B y B m )C α m = α y f B y B m < 0. = 0, m y and m m α m 20
21 A.2 A < 0 For any gven α y that satsfes ther correspondng constrants, both (26) and (28) are equvalent to mn α m,m y subject to When A < 0, t s equvalent to max α m,m y subject to 2 A( (α m + B m A )2) = α y, m y 1 m y α m α m 0, m y. (α m ) 2 + m y m y α m m y 2 B m A αm = α y, (55) α m 0, m y. (56) By constrants (55) and (56), we have (α m ) 2 ( α m ) 2 = ( α y )2, and m y m y B m A αm ( B m α m ) = B m A A αy. m y m y Note that when A < 0, B m m = arg max B m = arg mn m:m y m:m y A. Thus clearly the optmal soluton s { α m α y f m = m, = 0 otherwse. Sub-problem (26) s then reduced to the followng one-varable problem. mn A(α y C α y 0 )2 + (B y B m )α y = A(α y Its soluton s α y = { 0 f By B m C, 2A 2 C otherwse. + B y B m ) 2 + constants. 2A Combne them together, the optmal soluton of (26) when A < 0 s α m = 0, m = 1,..., k f B y B m AC, α y = C α m = α y f B y B m < AC. (57) = 0, m y and m m α m 21
22 When A = 0, AC = 0. Therefore, (57) can be used n L1 loss for A 0. For problem (28), when A < 0, t s reduced to another one-varable problem. mn A (α y α y 0 )2 + (B y B m )α y, (58) where A A+ 1. If 4C A = 0, then (58) reduces to a trval problem wth optmal soluton { α y 0 f B y B m 0, = f B y B m < 0. Thus the optmal soluton of (28) when A = 0 s α m = 0, m = 1,..., k f B y B m 0, α y = α m = f B y B m < 0. = 0, m y and m m, α m If A 0, (58) s equvalent to mn A (α y α y 0 + B y B m ) 2. 2A When A < 0, the optmal soluton of (28) s α y =, α m =, and α m = 0, m y and m m. Whle f A > 0, A < 0, the optmum occurs at α m = 0, m = 1,..., k f B y B m 0, α y = (B y B m ) 2A α m = α y f B y B m < 0. (59) = 0, m y and m m α m Note that when A = 0, 1/2A = 2C. Thus (59) can be used n L2 loss for A > 0, A 0. 22
Lecture 10 Support Vector Machines II
Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed
More informationKernel Methods and SVMs Extension
Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general
More informationMMA and GCMMA two methods for nonlinear optimization
MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons
More informationSupport Vector Machines
CS 2750: Machne Learnng Support Vector Machnes Prof. Adrana Kovashka Unversty of Pttsburgh February 17, 2016 Announcement Homework 2 deadlne s now 2/29 We ll have covered everythng you need today or at
More informationEEE 241: Linear Systems
EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they
More informationLecture 20: November 7
0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:
More informationLecture 3: Dual problems and Kernels
Lecture 3: Dual problems and Kernels C4B Machne Learnng Hlary 211 A. Zsserman Prmal and dual forms Lnear separablty revsted Feature mappng Kernels for SVMs Kernel trck requrements radal bass functons SVM
More information1 Convex Optimization
Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,
More informationSupport Vector Machines
Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far Supervsed machne learnng Lnear models Least squares regresson Fsher s dscrmnant, Perceptron, Logstc model Non-lnear
More informationNatural Language Processing and Information Retrieval
Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support
More informationCOS 521: Advanced Algorithms Game Theory and Linear Programming
COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton
More informationSupport Vector Machines
Support Vector Machnes Konstantn Tretyakov (kt@ut.ee) MTAT.03.227 Machne Learnng So far So far Supervsed machne learnng Lnear models Non-lnear models Unsupervsed machne learnng Generc scaffoldng So far
More informationSupport Vector Machines CS434
Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + + Intuton of Margn Consder ponts
More informationCSC 411 / CSC D11 / CSC C11
18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t
More informationSupport Vector Machines. Vibhav Gogate The University of Texas at dallas
Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest
More informationWhich Separator? Spring 1
Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal
More informationFor now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.
Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson
More informationFeature Selection: Part 1
CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?
More information10-701/ Machine Learning, Fall 2005 Homework 3
10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40
More informationThe Minimum Universal Cost Flow in an Infeasible Flow Network
Journal of Scences, Islamc Republc of Iran 17(2): 175-180 (2006) Unversty of Tehran, ISSN 1016-1104 http://jscencesutacr The Mnmum Unversal Cost Flow n an Infeasble Flow Network H Saleh Fathabad * M Bagheran
More informationAssortment Optimization under MNL
Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.
More informationImage classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?
Image classfcaton Gven te bag-of-features representatons of mages from dfferent classes ow do we learn a model for dstngusng tem? Classfers Learn a decson rule assgnng bag-offeatures representatons of
More informationGeneralized Linear Methods
Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set
More informationCollege of Computer & Information Science Fall 2009 Northeastern University 20 October 2009
College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:
More informationCS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015
CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research
More informationLagrange Multipliers Kernel Trick
Lagrange Multplers Kernel Trck Ncholas Ruozz Unversty of Texas at Dallas Based roughly on the sldes of Davd Sontag General Optmzaton A mathematcal detour, we ll come back to SVMs soon! subject to: f x
More informationprinceton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg
prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there
More informationYong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )
Kangweon-Kyungk Math. Jour. 4 1996), No. 1, pp. 7 16 AN ITERATIVE ROW-ACTION METHOD FOR MULTICOMMODITY TRANSPORTATION PROBLEMS Yong Joon Ryang Abstract. The optmzaton problems wth quadratc constrants often
More informationModule 9. Lecture 6. Duality in Assignment Problems
Module 9 1 Lecture 6 Dualty n Assgnment Problems In ths lecture we attempt to answer few other mportant questons posed n earler lecture for (AP) and see how some of them can be explaned through the concept
More informationCS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016
CS 29-128: Algorthms and Uncertanty Lecture 17 Date: October 26, 2016 Instructor: Nkhl Bansal Scrbe: Mchael Denns 1 Introducton In ths lecture we wll be lookng nto the secretary problem, and an nterestng
More informationMaximal Margin Classifier
CS81B/Stat41B: Advanced Topcs n Learnng & Decson Makng Mamal Margn Classfer Lecturer: Mchael Jordan Scrbes: Jana van Greunen Corrected verson - /1/004 1 References/Recommended Readng 1.1 Webstes www.kernel-machnes.org
More informationOn the Multicriteria Integer Network Flow Problem
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 5, No 2 Sofa 2005 On the Multcrtera Integer Network Flow Problem Vassl Vasslev, Marana Nkolova, Maryana Vassleva Insttute of
More informationAn Interactive Optimisation Tool for Allocation Problems
An Interactve Optmsaton ool for Allocaton Problems Fredr Bonäs, Joam Westerlund and apo Westerlund Process Desgn Laboratory, Faculty of echnology, Åbo Aadem Unversty, uru 20500, Fnland hs paper presents
More informationSolutions to exam in SF1811 Optimization, Jan 14, 2015
Solutons to exam n SF8 Optmzaton, Jan 4, 25 3 3 O------O -4 \ / \ / The network: \/ where all lnks go from left to rght. /\ / \ / \ 6 O------O -5 2 4.(a) Let x = ( x 3, x 4, x 23, x 24 ) T, where the varable
More informationC4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )
C4B Machne Learnng Answers II.(a) Show that for the logstc sgmod functon dσ(z) dz = σ(z) ( σ(z)) A. Zsserman, Hlary Term 20 Start from the defnton of σ(z) Note that Then σ(z) = σ = dσ(z) dz = + e z e z
More informationEnsemble Methods: Boosting
Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement
More informationSupport Vector Machines CS434
Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? Intuton of Margn Consder ponts A, B, and C We
More informationThe Study of Teaching-learning-based Optimization Algorithm
Advanced Scence and Technology Letters Vol. (AST 06), pp.05- http://dx.do.org/0.57/astl.06. The Study of Teachng-learnng-based Optmzaton Algorthm u Sun, Yan fu, Lele Kong, Haolang Q,, Helongang Insttute
More informationIntegrals and Invariants of Euler-Lagrange Equations
Lecture 16 Integrals and Invarants of Euler-Lagrange Equatons ME 256 at the Indan Insttute of Scence, Bengaluru Varatonal Methods and Structural Optmzaton G. K. Ananthasuresh Professor, Mechancal Engneerng,
More informationProblem Set 9 Solutions
Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem
More informationChapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems
Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons
More informationSupport Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012
Support Vector Machnes Je Tang Knowledge Engneerng Group Department of Computer Scence and Technology Tsnghua Unversty 2012 1 Outlne What s a Support Vector Machne? Solvng SVMs Kernel Trcks 2 What s a
More informationSupplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso
Supplement: Proofs and Techncal Detals for The Soluton Path of the Generalzed Lasso Ryan J. Tbshran Jonathan Taylor In ths document we gve supplementary detals to the paper The Soluton Path of the Generalzed
More informationSimultaneous Optimization of Berth Allocation, Quay Crane Assignment and Quay Crane Scheduling Problems in Container Terminals
Smultaneous Optmzaton of Berth Allocaton, Quay Crane Assgnment and Quay Crane Schedulng Problems n Contaner Termnals Necat Aras, Yavuz Türkoğulları, Z. Caner Taşkın, Kuban Altınel Abstract In ths work,
More informationINF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018
INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton
More informationInexact Newton Methods for Inverse Eigenvalue Problems
Inexact Newton Methods for Inverse Egenvalue Problems Zheng-jan Ba Abstract In ths paper, we survey some of the latest development n usng nexact Newton-lke methods for solvng nverse egenvalue problems.
More informationCSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography
CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve
More information4DVAR, according to the name, is a four-dimensional variational method.
4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The
More informationCHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE
CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE Analytcal soluton s usually not possble when exctaton vares arbtrarly wth tme or f the system s nonlnear. Such problems can be solved by numercal tmesteppng
More informationPsychology 282 Lecture #24 Outline Regression Diagnostics: Outliers
Psychology 282 Lecture #24 Outlne Regresson Dagnostcs: Outlers In an earler lecture we studed the statstcal assumptons underlyng the regresson model, ncludng the followng ponts: Formal statement of assumptons.
More informationSingular Value Decomposition: Theory and Applications
Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real
More informationLecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.
prnceton u. sp 02 cos 598B: algorthms and complexty Lecture 20: Lft and Project, SDP Dualty Lecturer: Sanjeev Arora Scrbe:Yury Makarychev Today we wll study the Lft and Project method. Then we wll prove
More informationLecture 10 Support Vector Machines. Oct
Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron
More informationSome modelling aspects for the Matlab implementation of MMA
Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton
More informationU.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017
U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that
More informationLecture Notes on Linear Regression
Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume
More informationA Local Variational Problem of Second Order for a Class of Optimal Control Problems with Nonsmooth Objective Function
A Local Varatonal Problem of Second Order for a Class of Optmal Control Problems wth Nonsmooth Objectve Functon Alexander P. Afanasev Insttute for Informaton Transmsson Problems, Russan Academy of Scences,
More informationDifference Equations
Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1
More informationSingle-Facility Scheduling over Long Time Horizons by Logic-based Benders Decomposition
Sngle-Faclty Schedulng over Long Tme Horzons by Logc-based Benders Decomposton Elvn Coban and J. N. Hooker Tepper School of Busness, Carnege Mellon Unversty ecoban@andrew.cmu.edu, john@hooker.tepper.cmu.edu
More informationOn a direct solver for linear least squares problems
ISSN 2066-6594 Ann. Acad. Rom. Sc. Ser. Math. Appl. Vol. 8, No. 2/2016 On a drect solver for lnear least squares problems Constantn Popa Abstract The Null Space (NS) algorthm s a drect solver for lnear
More informationSupport Vector Machines
Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class
More informationAn Admission Control Algorithm in Cloud Computing Systems
An Admsson Control Algorthm n Cloud Computng Systems Authors: Frank Yeong-Sung Ln Department of Informaton Management Natonal Tawan Unversty Tape, Tawan, R.O.C. ysln@m.ntu.edu.tw Yngje Lan Management Scence
More informationErrors for Linear Systems
Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch
More informationEvaluation of simple performance measures for tuning SVM hyperparameters
Evaluaton of smple performance measures for tunng SVM hyperparameters Kabo Duan, S Sathya Keerth, Aun Neow Poo Department of Mechancal Engneerng, Natonal Unversty of Sngapore, 0 Kent Rdge Crescent, 960,
More informationChapter - 2. Distribution System Power Flow Analysis
Chapter - 2 Dstrbuton System Power Flow Analyss CHAPTER - 2 Radal Dstrbuton System Load Flow 2.1 Introducton Load flow s an mportant tool [66] for analyzng electrcal power system network performance. Load
More informationSome Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)
Some Comments on Acceleratng Convergence of Iteratve Sequences Usng Drect Inverson of the Iteratve Subspace (DIIS) C. Davd Sherrll School of Chemstry and Bochemstry Georga Insttute of Technology May 1998
More informationADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING
1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N
More informationLinear Feature Engineering 11
Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19
More informationSupporting Information
Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to
More informationLecture 14: Bandits with Budget Constraints
IEOR 8100-001: Learnng and Optmzaton for Sequental Decson Makng 03/07/16 Lecture 14: andts wth udget Constrants Instructor: Shpra Agrawal Scrbed by: Zhpeng Lu 1 Problem defnton In the regular Mult-armed
More informationA NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of
A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS Dougsoo Kaown, B.Sc., M.Sc. Dssertaton Prepared for the Degree of DOCTOR OF PHILOSOPHY UNIVERSITY OF NORTH TEXAS May 2009 APPROVED:
More informationVARIATION OF CONSTANT SUM CONSTRAINT FOR INTEGER MODEL WITH NON UNIFORM VARIABLES
VARIATION OF CONSTANT SUM CONSTRAINT FOR INTEGER MODEL WITH NON UNIFORM VARIABLES BÂRZĂ, Slvu Faculty of Mathematcs-Informatcs Spru Haret Unversty barza_slvu@yahoo.com Abstract Ths paper wants to contnue
More informationFUZZY GOAL PROGRAMMING VS ORDINARY FUZZY PROGRAMMING APPROACH FOR MULTI OBJECTIVE PROGRAMMING PROBLEM
Internatonal Conference on Ceramcs, Bkaner, Inda Internatonal Journal of Modern Physcs: Conference Seres Vol. 22 (2013) 757 761 World Scentfc Publshng Company DOI: 10.1142/S2010194513010982 FUZZY GOAL
More informationLinear Classification, SVMs and Nearest Neighbors
1 CSE 473 Lecture 25 (Chapter 18) Lnear Classfcaton, SVMs and Nearest Neghbors CSE AI faculty + Chrs Bshop, Dan Klen, Stuart Russell, Andrew Moore Motvaton: Face Detecton How do we buld a classfer to dstngush
More informationCSE 546 Midterm Exam, Fall 2014(with Solution)
CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought:
More informationSolutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.
Solutons HW #2 Dual of general LP. Fnd the dual functon of the LP mnmze subject to c T x Gx h Ax = b. Gve the dual problem, and make the mplct equalty constrants explct. Soluton. 1. The Lagrangan s L(x,
More informationLecture 12: Discrete Laplacian
Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly
More informationLOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin
Proceedngs of the 007 Wnter Smulaton Conference S G Henderson, B Bller, M-H Hseh, J Shortle, J D Tew, and R R Barton, eds LOW BIAS INTEGRATED PATH ESTIMATORS James M Calvn Department of Computer Scence
More informationSolving the Quadratic Eigenvalue Complementarity Problem by DC Programming
Solvng the Quadratc Egenvalue Complementarty Problem by DC Programmng Y-Shua Nu 1, Joaqum Júdce, Le Th Hoa An 3 and Pham Dnh Tao 4 1 Shangha JaoTong Unversty, Maths Departement and SJTU-Parstech, Chna
More informationA fast iterative algorithm for support vector data description
https://do.org/10.1007/s13042-018-0796-7 ORIGINAL ARTICLE A fast teratve algorthm for support vector data descrpton Songfeng Zheng 1 Receved: 9 February 2017 / Accepted: 26 February 2018 Sprnger-Verlag
More informationOnline Classification: Perceptron and Winnow
E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng
More informationMACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression
11 MACHINE APPLIED MACHINE LEARNING LEARNING MACHINE LEARNING Gaussan Mture Regresson 22 MACHINE APPLIED MACHINE LEARNING LEARNING Bref summary of last week s lecture 33 MACHINE APPLIED MACHINE LEARNING
More informationSTAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16
STAT 39: MATHEMATICAL COMPUTATIONS I FALL 218 LECTURE 16 1 why teratve methods f we have a lnear system Ax = b where A s very, very large but s ether sparse or structured (eg, banded, Toepltz, banded plus
More informationHongyi Miao, College of Science, Nanjing Forestry University, Nanjing ,China. (Received 20 June 2013, accepted 11 March 2014) I)ϕ (k)
ISSN 1749-3889 (prnt), 1749-3897 (onlne) Internatonal Journal of Nonlnear Scence Vol.17(2014) No.2,pp.188-192 Modfed Block Jacob-Davdson Method for Solvng Large Sparse Egenproblems Hongy Mao, College of
More informationRegularized Discriminant Analysis for Face Recognition
1 Regularzed Dscrmnant Analyss for Face Recognton Itz Pma, Mayer Aladem Department of Electrcal and Computer Engneerng, Ben-Guron Unversty of the Negev P.O.Box 653, Beer-Sheva, 845, Israel. Abstract Ths
More informationModule 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:
More informationSemi-supervised Classification with Active Query Selection
Sem-supervsed Classfcaton wth Actve Query Selecton Jao Wang and Swe Luo School of Computer and Informaton Technology, Beng Jaotong Unversty, Beng 00044, Chna Wangjao088@63.com Abstract. Labeled samples
More informationFeature Selection in Multi-instance Learning
The Nnth Internatonal Symposum on Operatons Research and Its Applcatons (ISORA 10) Chengdu-Juzhagou, Chna, August 19 23, 2010 Copyrght 2010 ORSC & APORC, pp. 462 469 Feature Selecton n Mult-nstance Learnng
More informationFisher Linear Discriminant Analysis
Fsher Lnear Dscrmnant Analyss Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan Fsher lnear
More informationHomework Assignment 3 Due in class, Thursday October 15
Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.
More informationVQ widely used in coding speech, image, and video
at Scalar quantzers are specal cases of vector quantzers (VQ): they are constraned to look at one sample at a tme (memoryless) VQ does not have such constrant better RD perfomance expected Source codng
More informationInner Product. Euclidean Space. Orthonormal Basis. Orthogonal
Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,
More informationLinear Approximation with Regularization and Moving Least Squares
Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...
More informationLecture 6: Support Vector Machines
Lecture 6: Support Vector Machnes Marna Melă mmp@stat.washngton.edu Department of Statstcs Unversty of Washngton November, 2018 Lnear SVM s The margn and the expected classfcaton error Maxmum Margn Lnear
More informationInteractive Bi-Level Multi-Objective Integer. Non-linear Programming Problem
Appled Mathematcal Scences Vol 5 0 no 65 3 33 Interactve B-Level Mult-Objectve Integer Non-lnear Programmng Problem O E Emam Department of Informaton Systems aculty of Computer Scence and nformaton Helwan
More informationComputing Correlated Equilibria in Multi-Player Games
Computng Correlated Equlbra n Mult-Player Games Chrstos H. Papadmtrou Presented by Zhanxang Huang December 7th, 2005 1 The Author Dr. Chrstos H. Papadmtrou CS professor at UC Berkley (taught at Harvard,
More informationBOUNDEDNESS OF THE RIESZ TRANSFORM WITH MATRIX A 2 WEIGHTS
BOUNDEDNESS OF THE IESZ TANSFOM WITH MATIX A WEIGHTS Introducton Let L = L ( n, be the functon space wth norm (ˆ f L = f(x C dx d < For a d d matrx valued functon W : wth W (x postve sem-defnte for all
More informationLearning Theory: Lecture Notes
Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be
More informationLecture 4. Instructor: Haipeng Luo
Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would
More information6.854J / J Advanced Algorithms Fall 2008
MIT OpenCourseWare http://ocw.mt.edu 6.854J / 18.415J Advanced Algorthms Fall 2008 For nformaton about ctng these materals or our Terms of Use, vst: http://ocw.mt.edu/terms. 18.415/6.854 Advanced Algorthms
More information