A Sequential Dual Method for Large Scale Multi-Class Linear SVMs

A Sequential Dual Method for Large Scale Multi-Class Linear SVMs Kai-Wei Chang Dept. of Coputer Science National Taiwan University Taipei 106, Taiwan b92084@csie.ntu.edu.tw S. Sathiya Keerthi Yahoo! Research Santa Clara, CA selvarak@yahoo-inc.co Cho-Jui Hsieh Dept. of Coputer Science National Taiwan University Taipei 106, Taiwan b92085@csie.ntu.edu.tw S. Sundararajan Yahoo! Labs Bangalore, India ssrajan@yahoo-inc.co Chih-Jen Lin Dept. of Coputer Science National Taiwan University Taipei 106, Taiwan cjlin@csie.ntu.edu.tw ABSTRACT Efficient training of direct ulti-class forulations of linear Support Vector Machines is very useful in applications such as text classification with a huge nuber exaples as well as features. This paper presents a fast dual ethod for this training. The ain idea is to sequentially traverse through the training set and optiize the dual variables associated with one exaple at a tie. The speed of training is enhanced further by shrinking and cooling heuristics. Experients indicate that our ethod is uch faster than state of the art solvers such as bundle, cutting plane and exponentiated gradient ethods. Categories and Subject Descriptors I.5.2 [Pattern Recognition]: Design Methodology Classifier design and evaluation General Ters Algoriths, Perforance, Experientation 1. INTRODUCTION Support vector achines (SVM) [2] are popular ethods for solving classification probles. An SVM eploying a nonlinear kernel aps the input x to a high diensional feature space via a nonlinear function φ(x) and fors a linear classifier in that space to solve the proble. Inference as well as design are carried out using the kernel function, K(x i, x j) = φ(x i) T φ(x j). In several data ining applications the input vector x itself has rich features and so it is sufficient to take φ as the identity ap, i.e. φ(x) = x. SVMs developed in this setting are referred to as Linear SVMs. Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. KDD 08, August 24 27, 2008, Las Vegas, Nevada, USA. Copyright 2008 ACM 978-1-558-193-4/08/08...$5.00. In applications such as text classification the input vector x resides in a high diensional space (e.g. a bag-of-words representation), the training set has a large nuber of labeled exaples, and the training data atrix is sparse, i.e. the average nuber of non-zero eleents in one x is sall. Most classification probles arising in such doains involve any classes. Efficient learning of the paraeters of linear SVMs for such probles is iportant. The One-Versus-Rest (OVR) ethod which consists of developing, for each class, a binary classifier that separates that class fro the rest of the classes, and then cobining the classifiers for ulti-class inference, is one of the popular schees to solve a ulti-class proble. OVR is easy and efficient to design and also gives good perforance [21]. Craer and Singer [5, 6] and Weston and Watkins [25] gave direct ulti-class SVM forulations. With nonlinear kernels in ind, Craer and Singer [5, 6] and Hsu and Lin [10] gave efficient decoposition ethods for these forulations. Recently, soe good ethods have been proposed for SVMs with structured outputs, that also specialize nicely for Craer-Singer ulti-class linear SVMs. Teo et al. [23] suggested a bundle ethod and Joachis et al. [13] gave a cutting plane ethod that is very close to it; these ethods can be viewed as extensions of the ethod given by Joachis [12] for binary linear SVMs. Collins et al. [4] gave the exponentiated gradient ethod. In this paper we present fast dual ethods for Craer- Singer and Weston-Watkins forulations. Like the ethods of Craer and Singer [5, 6] and Hsu and Lin [10] our ethods optiize the dual variables associated with one exaple at a tie. However, while those previous ethods always choose the exaple that violates optiality the ost for updating, our ethods sequentially traverse through the exaples. For the linear SVM case this leads to significant differences. In particular, for this case our ethods are extreely efficient while those ethods are not so efficient. We borrow and odify the shrinking and cooling heuristics of Craer and Singer [6] to ake our ethods even faster. Experients on several datasets indicate that our ethod is also uch faster than cutting plane, bundle and exponentiated gradient ethods. For the special case of binary classification our ethods turn out to be equivalent to doing coordinate descent. We covered this ethod in detail in [9].

Table 1.1: Properties of datasets Dataset l l test n k # nonzeros in whole data NEWS20 15,935 3,993 62,061 20 1,593,692 SECTOR 6,412 3,207 55,197 105 1,569,4 MNIST,000 10,000 779 10 10,5,375 RCV1 531,742 15,913 47,236 103 36,734,621 COVER 522,911 58,101 54 7 6,972,144 Notations. The following notations are used in the paper. We use l to denote the nuber of training exaples, l test to denote the nuber of test exaples and k to denote the nuber of classes. Throughout the paper, the index i will denote a training exaple and the index will denote a class. Unless otherwise entioned, i will run fro 1 to l and will run fro 1 to k. x i R n is the input vector (n is the nuber of input features) associated with the i-th exaple and y i {1,..., k} is the corresponding target class. We will use 1 to denote a vector of all 1 s whose diension will be clear fro the context. Superscript T will denote transpose for vectors. The following ulti-class datasets are used in the paper for doing various experients: NEWS20 [14], SECTOR [20, 19], MNIST [15], RCV1 [16] and COVER [13]. RCV1 is a taxonoy based ulti-label dataset and so we odified it to a ulti-class dataset by reoving all exaples which had isatched labels in the taxonoy tree; in the process of doing this, two classes do not have any training exaples. For NEWS20, MNIST and RCV1 we used the standard train-test split; for SECTOR we used the second train/test split used in [20]; and, for COVER we used the train/test split used in [13]. Table 1.1 gives key properties of these datasets. The paper is organized as follows. In section 2 we give the details of our ethod (we call it as the Sequential Dual Method (SDM)) for the Craer-Singer forulation and evaluate it against other ethods. In section 3 we describe SDM for the Weston-Watkins forulation. In section 4 we copare these ethods with the One-Versus-Rest ethod of solving ulti-class probles. We conclude the paper in section 5. An ipleentation of a variant of the SDM algorith for Craer-Singer forulation is available in the LIBLINEAR library (version 1.3 or higher) with the option -s 4 (http://www.csie.ntu.edu.tw/~cjlin/liblinear). 2. SDM FOR CRAMMER-SINGER FORMU- LATION In [5, 6], Craer and Singer proposed an approach for the ulti-class proble by forulating the following prial proble: in {w },{ξ i } P 1 2 w 2 + C P i ξi s.t. w T y i x i w T x i e i ξ i, i, (1) where C > 0 is the regularization paraeter, w is the weight vector associated with class, e i = 1 δ yi, and δ yi, = 1 if y i =, δ yi, = 0 if y i. Note that, in (1) the constraint for = y i corresponds to the non-negativity constraint, ξ i 0. The decision function is arg ax wt x. (2) The dual proble of (1), developed along the lines of [5, 6], involves a vector α having dual variables αi, i. The w get defined via α as w (α) = X i α i x i (3) At ost places below, we siply write w (α) as w. Let Ci = 0 if y i, Ci = C if y i =. The dual proble is P in f(α) = 1 α 2 w(α) 2 + P P i e i αi s.t. (αi Ci P, α i = 0) i, (4) The gradient of f plays an iportant role and is given by g i = f(α) α i = w (α) T x i + e i i,. (5) If n is the average nuber of nonzero eleents per training exaple, then on average the evaluation of each gi takes O( n) effort. Optiality of α for (4) can be checked using the quantity, v i = ax g i in :α i <C i g i i. (6) For a given i, the values of that attain the ax and in in (6) play a useful role and are denoted as follows: M i = arg ax g i and i = arg in :α i <C i g i. (7) Fro (6) it is clear that v i is non-negative. Dual optiality holds when: v i = 0 i. (8) For practical terination we can approxiately check optiality using a tolerance paraeter, ǫ > 0: v i < ǫ i. (9) We will also refer to this as ǫ-optiality. The value ǫ = 0.1 is a good choice for a practical ipleentation of SDM. SDM consists of sequentially picking one i at a tie and solving the restricted proble of optiizing only αi, keeping all other variables fixed. To do this, we let δαi denote the additive change to be applied to the current αi, and optiize δαi. Let α i, δα i, g i and C i be vectors that respectively gather the eleents αi, δαi, gi and Ci over all. With A i = x i 2 the sub-proble of optiizing δα i is given by 1 in δα 2 Ai δαi 2 + g T i δα i i s.t. δα i C i α i, 1 T δα i = 0 (10) This can be derived by noting the following: (i) changing α i to α i + δα i causes w to change to w + δα i x i; (ii) w + δα i x i 2 = w 2 + A i(δα i ) 2 + 2(w T x i)δα i ; (iii) e i α i changes to e i α i + e i δα i ; (iv) using the definition When specialized to binary classification (k = 2), for each i the equality constraint in (4) can be used to eliinate the variable αi, y i. With this done and the dual solution viewed as optiization for the variables {α y i i }i, SDM is equivalent to doing coordinate descent. This is what we did in [9].

Algorith 2.1 SDM for Craer-Singer Forulation Initialize α and the corresponding w. Until (8) holds in an entire loop over exaples, do: For i = 1,..., l (a) Copute gi and obtain v i (b) If v i 0, solve (10) and set: α i α i + δα i w w + δαi x i of gi in (5); and (v) using the above in (4) and leaving out all constants that do not depend on δα i. The sub-proble in (10) has a nice siple for. Let us discuss ethods for solving it. Suppose A i = 0 for soe i, which can happen only when x i = 0. In the prial proble (1) such exaples contribute a fixed cost of C to the objective function, they have no effect on the w, and so they can be effectively ignored. So, hereafter we will assue A i > 0 i. Craer and Singer [5, 6] suggest two ethods for solving (10): (i) an exact O(k log k) algorith (see section 6.2 of [5]) and (ii) an approxiate iterative fixed-point algorith (see section 6 of [6]). Alternatively, one could also eploy an active set ethod eant for convex quadratic prograing probles [8], starting fro δα i = 0 as the initial point. Since the Hessian of the objective function in (10) is A i ties the identity atrix and the constraints of (10) are in a siple for, various linear equation solving steps of the active set ethod can be done analytically and cheaply. Our ipleentation uses this ethod. A description of SDM is given in Algorith 2.1. If good seed values are unavailable, a siple way to initialize is to set α = 0; this corresponds to w = 0. Let us now discuss the coplexity of the algorith. Each execution of step (a) requires O(k n) effort; recall that n is the average nuber of nonzero eleents per training exaple. The cost of step (b) is at ost O(k n), and it could be uch less depending on the nuber of αi that are actually changed during the solution of (10). Since the solution of (10) itself is usually not as high as these costs if k is not very large, it is reasonable to take the cost of an update for one i as O(k n). Thus the overall cost of a loop, i.e. a full sequence of updates over all exaples is O(lk n). The ain eory requireent of the algorith consists of storing the data, x i i (O(l n)), the weights, w (O(kn)) and the αi i, (at ost O(kl)). For probles with a large nuber of classes but having only a sall nuber of non-zero αi, it is prudent to use a linked list data structure and store only those αi which are non-zero. The following convergence theore can be proved for SDM, eploying ideas siilar to those in [17]. Theore 2.1 Let α t denote the α at the end of the t-th loop of Algorith 2.1. Any liit point of a convergent subsequence of {α t } is a global iniu of (4). Because of space liitation we oit the proof. In [9] we showed that SDM for binary classification has a linear rate of convergence, i.e., there exists 0 µ < 1 and a t 0 such that f(α t ) f(α ) µ t t 0 (f(α t 0 ) f(α )), t t 0, (11) where f is the dual objective function, t is the loop count in the algorith, α t is the α at the end of the t-th loop, and α is the dual optial solution. On the other hand Theore 2.1 only iplies a weaker for of convergence for Algorith 2.1. As we rearked earlier, for binary classification SDM becoes coordinate descent, a standard ethod for which good convergence results exist in the optiization literature [18]. The presence of the equality constraint in (4) akes the ulti-class version soewhat non-standard. It is an interesting open proble to establish linear convergence for Algorith 2.1. Let us now discuss various ways of aking Algorith 2.1 ore efficient. In the algorith the exaples are considered in the given order i = 1, 2,..., l for updating. Any systeatic regularities in the given order (for instance, all exaples of one class coing together) ay lead to slow convergence. So it is a good idea to randoly perute the exaples. In fact it is useful to do this rando perutation before each loop through the exaples. We do this in our ipleentation of SDM. In [9] we found the sae idea to be useful for binary classification. Theore 2.1 is valid even in the presence of such rando perutations. In the solution of the sub-proble (10), the variables, α i corresponding to the ost violating indices {M i, i} (see (7)) are particularly iportant. When the nuber of classes is large, for efficiency one could consider solving the sub-proble only for these two variables. An extended version of this idea is the following. Since the inactive variables, i.e., : α i < C i are iportant, we can optiize on the variables α i for = M i and : α i < C i. In ost probles the nuber of inactive variables per exaple is quite sall (even when the nuber of classes is large) and so we use this ethod in our ipleentation. 2.1 Heuristics for iproving efficiency Craer and Singer [6] eployed two heuristics to speed up their algorith. We eploy these heuristics for SDM too. We now discuss these heuristics. Shrinking. Shrinking [11], a technique for reoving variables that are not expected to change, is known to be a very good way to speed up decoposition ethods for binary SVMs. In the ulti-class case, it ay turn out that for any exaples αi = 0. So, for efficiency Craer and Singer [6] suggested the idea of concentrating the algorith s effort on exaples i for which there is at least one non-zero αi. (If, for a given i there is at least one nonzero αi, then it is useful to observe the following using the special nature of the constraints in the dual proble: (i) there are at least two non-zero αi ; and (ii) α y i i > 0.) We can get further iproved efficiency by doing two odifications. The first odification is to also reove those i for which there is exactly one y i with αi = α y i i = C. (In other words, apart fro reoving those i with αi = 0, we are also reoving those i with exactly two non-zero αi each having agnitude C.) At optiality this generically corresponds to arg ax yi w T x i being a singleton and ξ i = w T x i w T y i x i + 1 > 0. As we approach optiality, for the that attains the arg ax above we expect αi to stay at C, α y i i to stay at C and αi = 0 for all other. If there are any such exaples, then this odification will give good benefits. The second odification is the follow- This corresponds to the observation that for any given exaple the confusion in classification is only aongst a few classes.

ing: even within an exaple i that does not get reoved, consider for updating only those αi which are non-zero. Efficiency iproveent due to this odification can be substantial when the average nuber of non-zero αi for each exaple is uch saller than the nuber of classes. With these odifications included in shrinking SDM is changed as follows. We alternate between the following two steps. 1. First do a full sequential loop over all exaples. 2. Do several loops over only those exaples i that do not satisfy the condition for reoval discussed above. When doing this, for each i that is included, evaluate gi and optiize αi only for those that have nonzero αi. Repeat these shrunk loops until either ǫ- optiality is satisfied on all the positive αi or the effort associated with these loops exceeds a specified aount. Cooling of accuracy paraeter. Suppose we desire to solve the dual proble quite accurately using a sall value of ǫ. When shrinking is eployed there is a tendency for the algorith to spend a lot of effort in the shrunk loops. The algorith has to wait for the next full loop (where other optiality-violating αi get included) to ake a decent iproveent in the dual objective function. To avoid a staircase behavior of the objective function with coputing tie, it is a good idea to start with a large value for ǫ and decrease it in several graded steps until the final desired ǫ is reached. In our ipleentation we start fro ǫ = 1 and decrease it by a factor of 10 in each graded step. Figure 1 deonstrates the effectiveness of including the heuristics on two datasets. Clearly, the heuristics are valuable when (4) needs to be solved quite accurately. 2.2 Coparison with Craer and Singer s ethod In [5, 6], Craer and Singer gave a ethod for solving (4) with nonlinear kernels in ind. Their ethod also uses the basic idea of choosing an i and solving (10) to update α i. Unlike our ethod which does a sequential traversal of the i, Craer and Singer s ethod always chooses the i for which the optiality violation v i in (6) is largest at the current α. To appreciate Craer and Singer s otivation for doing this, first note that the gi are key quantities for calculating the v i as well as for setting up (10). So let us look at the coputational coplexity associated with gi in soe detail. For probles with a nonlinear kernel K(x j, x i), the gi need to be coputed using g i = X j α j K(x j, x i) + e i i,. (12) Let us recall the notations g i, α i and δα i given above (10). Direct coputation of g i for one given i via (12) takes O(lk n) effort. On the other hand, suppose we aintain g j j. In a typical step of Craer and Singer s ethod, an i is chosen and α i is changed as α i α i + δα i; in such a case the g j can be updated using g j g j + K(x j, x i)δα i j. (13) In our ipleentation we count the total nuber of g i evaluations, divide it by kl to get the nuber of effective loops and terinate the shrunk loops when the nuber of effective loops exceeds 5. 10 8 0 10 2 10 4 10 2 10 4 Figure 1: Coparison of SDM with (Red, dashdot) and without (Blue, solid) the shrinking and cooling heuristics on MNIST (top) and COVER (botto) datasets. The horizontal axis is Training tie in seconds. which also requires (total, for all j) only O(lk n) effort. Therefore aintaining g j j does not cost ore than the direct coputation of a single g i using (12). As using g j j iplies fewer iterations (i.e., faster convergence due to the ability to choose for updating the i that violates optiality ost), of course one should take this advantage. This clearly otivates the schee for choosing i in Craer and Singer s ethod. However, the situation for linear SVM is very different. With the cheap O(k n) way (g i = w T x i + e i ) of coputing g i and the cheap O(k n) way (w w + δα i x i ) of updating the w, each update of α i takes only O(k n) effort, and so, if l is large, SDM can do any ore (O(l)) updates of the w for the sae effort as needed by Craer and Singer s ethod for one update of the w. Hence, for linear SVM, always choosing the i that yields the axiu v i is unwise and sequential traversal of the i is a uch better choice. 2.3 Coparison with other ethods We now copare our ethod with other ethods for solving (1). The cutting plane ethod of Joachis et al. [13] and the bundle ethod of Teo et al. [23] are given for the ore general setting of SVMs with structured outputs, but they also specialize to ulti-class linear SVMs using the Craer-Singer forulation. These ethods cobine the loss ter C P i ξi in (1) into a single loss function and build convex, piecewise linear lower bounding approxiations for this function by using sub-differentials of it fro various points encountered during the solution. A negative aspect

40 30 10 1 10 2 10 3 20 10 1 10 2 10 3 40 30 10 1 10 2 10 3 20 10 1 10 2 10 3 Figure 2: Coparison of SDM (Red, dashdot), EG (Blue, dashed) and Bundle (Green, solid) on MNIST dataset. The horizontal axis is Training tie in seconds. Each row corresponds to one C value; Row 1: C = 10; Row 2: C = 0.1 of these algoriths is their need to scan the entire training set in order to for one sub-differential entioned above; as we will see below in the experients, this causes these algoriths to be slow to converge. Barring soe differences in ipleentation the algoriths in [13] and [23] are the sae and we will siply refer to the jointly as the Cutting Plane (CP) ethod. For coparison we use the ipleentation in http://www.cs.cornell.edu/people/tj/ sv_light/sv_ulticlass.htl. The Exponentiated Gradient (EG) ethod [4] also applies to structured output probles and specializes to the Craer-Singer ulti-class linear SVM forulation. It uses a dual for that is slightly different fro (4), but is equivalent to it by a siple translation and scaling of variables. Like our ethod the EG ethod updates the dual variables associated with one exaple at a tie, but the selection of exaples is done in an online ode where an i is picked randoly each tie. The EG ethod does an approxiate descent via a constraint-obeying exponentiated gradient step. By its very design all dual inequality constraints reain inactive (i.e., α i < C i i, ) throughout the algorith, though, as optiality is reached soe of the ay asyptotically becoe active. The EG ethod is very straightforward to ipleent and we use our own ipleentation for the experients given below. We now copare our ethod (SDM) against CP and EG on the five datasets used in this paper, both with respect to the ability to converge well to the optiu of (4) as well as the ability to reach steady state test set perforance fast. We evaluate the forer by the Relative Function Value Difference given by (f f )/f where f is the optial objective function value of (4), and the latter by Test Set Accuracy which is the percentage of test exaples that are classified correctly. For siplicity we set C = 1 for this experient. Figure 3 (see the last page of the paper) gives plots of these two quantities as a function of training tie for the five datasets. Overall, SDM is uch faster than EG, which in turn is uch faster than CP. This is true irrespective of whether the ai is to efficiently reach the optiu dual objective very closely or reach the steady state test accuracy fast. On COVER (see the botto left row of Figure 3) EG has difficulty reaching the optial objective function value, which is possibly due to nuerical issues associated with large exponential operations involved in its updates. When the regularization paraeter C needs to be tuned, say via cross-validation techniques, it is necessary to solve (4) for a range of C values. It is iportant for an algorith to be efficient for widely varying C values. For one dataset, MNIST, Figure 2 copares the three ethods for large and sall C values (C = 10 and C = 0.1); for the interediate value of C = 1, see also the plots for MNIST in Figure 3. When actually finding solutions for an ordered set of several C values, the optial α and the w obtained fro the algorith for one C value can be used to for initializations of corresponding quantities of the algorith for the next C value. This is a standard seeding technique that iproves efficiency, and is also used in the EG ethod [4]. Stochastic gradient descent ethods [22, 3] can be applied to the online version of (1) and they are known to achieve steady state test perforance quite fast. We have not evaluated SDM against these ethods. However, our evaluation in [9] shows that, for binary classification SDM is faster than stochastic gradient ethods. So we expect the sae result to hold for the ulti-class case too. The odified MIRA algorith proposed by Craer and Singer [7] has soe siilarities with our sequential dual ethod, but it is given in an online setting and does not solve the batch proble. Also, ideas such as shrinking do not apply to it. Bordes et al. [1] apply a coordinate descent ethod for ulti-class SVM, but they focus on nonlinear kernels.

3. SDM FOR WESTON-WATKINS FORMU- LATION Let e i and δ i be as defined in section 2. In [25], Weston and Watkins proposed an approach for the ulti-class proble by forulating the following prial proble: 1 X in w 2 + C X X {w },{ξ i } 2 i y i ξ i s.t. w T y i x i w T x i e i ξ i, ξ i 0 y i, i. (14) The decision function reains the sae as in (2). The dual proble of (14) involves a vector α having dual variables αi i, y i. For convenience let us set and define α y i i = X y i α i, w (α) = X i The dual of (14) can now be written as P w(α) 2 + P i α i x i. (15) P y i α i in α f(α) = 1 2 s.t. 0 αi C y i, i. (16) The gradient of f is given by gi = f(α) = w αi yi (α) T x i w (α) T x i 1 i, y i. (17) Optiality can be checked using vi, y i, defined as: 8 >< gi if 0 < αi < C, vi = ax(0, gi ) if αi = 0, (18) >: ax(0, gi ) if αi = C. Clearly vi 0. Optiality holds when: vi = 0 i, i. (19) For practical terination we can approxiately check this using a tolerance paraeter, ǫ > 0: v i < ǫ i, i. (20) The value ǫ = 0.1 is a good choice for a practical ipleentation. Like in section 2, the Sequential Dual Method (SDM) for the Weston-Watkins forulation consists of sequentially picking one i at a tie and solving the restricted proble of optiizing only αi y i. To do this, we let δαi denote the additive change to be applied to the current αi, and optiize δαi y i. Let α i, δα i and g i be vectors that respectively gather the eleents αi, δαi and gi over all y i. With A i = x i 2 the sub-proble of optiizing δα i is given by 1 in δα 2 Ai δαi 2 + 1 2 Ai(1T δα i) 2 + g T i δα i i s.t. 0 δα i C1 α i (21) Like (10), the sub-proble (21) too has a siple for. We solve it also using the active set ethod. We note that the Hessian of the objective function in (21) is A i(i + 11 T ) (where I is the identity atrix), and so, any of its block diagonal sub-atrices can be analytically and cheaply inverted. Therefore the active-set ethod is very efficient and we used Algorith 3.1 SDM for Weston-Watkins Forulation Initialize α and the corresponding w. Until (19) holds in an entire loop over exaples do: For i = 1,..., l (a) Copute gi y i and obtain v i (b) If ax vi 0, solve (21) and set: αi αi + δαi y i Set δα y i i = P y i δαi. w w δαi x i 10 1 10 2 10 3 10 2 10 4 Figure 4: Coparison of SDM (Weston-Watkins) with (Red, dashdot) and without (Blue, solid) the shrinking and cooling heuristics on MNIST (top) and COVER (botto) datasets. The horizontal axis is Training tie in seconds. this ethod in our ipleentation. A coplete description of SDM for the Weston-Watkins forulation is given in Algorith 3.1. The initialization, as well as tie and eory coplexity analysis given earlier for Craer-Singer SDM also hold for this algorith. Turning to convergence, let us assue, like in section 2, that A i > 0 i. Unlike (4), the dual (16) does not have any equality constraints. Algorith 3.1 is a block coordinate descent ethod for which convergence results fro the optiization literature (see [18] and in particular section 6 there) can be applied to show that Algorith 3.1 has linear convergence. Finally, siilar to section 2.2 we can argue that, the ethod of Hsu and Lin [10] is suited for nonlinear kernels, but SDM is better for the linear SVM case. For efficiency, (21) can be solved only for soe restricted variables, say only the δαi for which vi > 0. The heuristics given for Craer-Singer SDM can be extended to the Weston-Watkins version too. In the shrinking phase, for each i, aong all i we only copute gi and update

α i for those which satisfy 0 < α i < C. Figure 4 shows the effectiveness of including the heuristics on MNIST and COVER datasets. 4. COMPARISON WITH ONE-VERSUS-REST The One-Versus-Rest (OVR) ethod which consists of developing, for each class, a binary classifier (w ) that separates class fro the rest of the classes, and then using (2) for inference, is one of the popular schees to solve a ulticlass proble. For nonlinear SVMs, Rifkin and Klatau [21] argue that OVR is as good as any other approach if the binary classifiers are well-tuned regularized classifiers like SVMs. The ain points in defense of OVR (and against direct ulti-class ethods) are: (1) coplicated ipleentation of direct ulti-class ethods, (2) their slow training, and (3) OVR yields accuracy siilar to that given by direct ulti-class ethods. These points should be viewed differently when considering linear SVMs. Firstly, a ethod such as SDM is easy to ipleent. Secondly, with respect to training speed, the following siplistic analysis gives insight. It is reasonable to take the cost of solution of a dual proble (whether it is a binary or a direct ulti-class one) to be a function T(l d ) where l d is the nuber of dual variables in that proble. In OVR we solve k binary probles each of which has l dual variables and so its cost is kt(l). For a direct ulti-class solution, e.g. Craer-Singer, the nuber of dual variables is kl and therefore its cost is T(kl). For nonlinear SVMs T( ) is a nonlinear function (soe epirical studies have shown it to be close to a quadratic) and so OVR solution is uch faster than direct ulti-class solutions. On the other hand, for linear SVMs, T( ) tends to be a linear function and so the speeds of OVR and direct ulticlass solutions are close. We observe this in an experient given below. Thirdly, perforance differences between OVR and direct ulti-class ethods is dependent on the dataset. The experients below offer ore insight on this aspect. First we do an ipleentation of OVR in which each SVM binary classifier eploying the standard hinge loss is solved using the binary version of SDM, as developed in [9]. By the results in [9] this is probably the fastest solver for OVR for the types of datasets considered in this paper. We begin with a siple experient where we fix C = 1 and solve the ulti-class probles using OVR and SDM for Craer- Singer and Weston-Watkins forulations. For MNIST and COVER datasets Figure 5 gives the behavior of test set perforance as the solutions progress in tie. For the OVR ethod test set perforance was evaluated after each loop of the training set in each binary classifier solution, whereas, for the direct ulti-class SDM the evaluation was done on a finer scale, four ties within one loop. Clearly, all three ethods take nearly the sae aount of coputing tie to reach their respective steady state values. This is in agreeent with our earlier ention that, for linear SVMs OVR and direct ulti-class solutions have siilar coplexity. Also, in these datasets the final test set accuracies of Craer-Singer and Weston-Watkins forulations are better than the accuracy of OVR. Since accuracy depends on the choice of C, we do a ore detailed study on the accuracy aspects further. For each of This can be seen fro the coplexity analysis of section 2 coupled with the assuption that a fixed sall nuber of loops gives good enough convergence. Test Accuracy (%) Test Accuracy (%) 94 92 88 86 84 82 74 72 68 66 64 62 2 4 6 8 10 12 14 16 18 20 20 40 100 120 140 1 1 200 Figure 5: Coparison of One-Versus-Rest (Green, solid), Craer-Singer (Red, dashdot) and Weston- Watkins (Blue, dashed) on MNIST (top) and COVER (botto) datasets. The horizontal axis is Training tie in seconds. the 5 datasets, we cobined the training and test datasets and ake 10 rando %/20%/20% train/validation/test class-stratified partitions. For each of the three ulti-class ethods the C paraeter is tuned using the train and validation sets by considering the following set of C values: {0.01, 0.03, 0.1, 0.3, 1, 3, 10}. For the best C value that gives the largest validation set accuracy, a re-training is done on the train+validation set and the resulting classifier is evaluated on the test set. The ean and standard deviation values of test set accuracies are reported in Table 4.2. We conduct Wilcoxon sign-rank test with a significance level of 0.01 to copare OVR against the direct ulti-class ethods on each dataset; see Table 4.3. Coparison of OVR and Craer-Singer indicates that the observed differences are statistically significant in favor of Craer-Singer for SECTOR, MNIST, RCV1 and COVER datasets. Coparison of OVR and Weston-Watkins indicates that the results are statistically significant for all the datasets. However, on NEWS20 and SECTOR the results are in favor of OVR while on the other three datasets they are in favor of Weston-Watkins. Though statistical significance analysis ay not ean practically significant iproveents, we do note that the iproveents shown by the direct ulticlass ethods over OVR on MNIST and COVER datasets are quite good. Overall, for linear SVMs it ay be better to eploy the direct ulti-class ethods (in particular, Craer-Singer) whenever fast solvers for the (such as ones based on SDM) are available. It is worth noting that, of the five datasets under consideration only MNIST and COVER are un-noralized, i.e. x i is not the sae for all i. It is possible that OVR s perforance is ore affected by

Table 4.2: Test set accuracy (Mean and Standard deviation) coparison of One-Versus-Rest (OVR), and Multi-class SDMs for Craer-Singer (CS) and Weston-Watkins (WW) forulations with C tuned for each ethod. Dataset OVR CS WW NEWS20 85.39 ± 0.37 85.26 ± 0.37 85.10 ± 0.43 SECTOR 94.83 ± 0.56 95.17 ± 0.62 94.37 ± 0.57 MNIST 91.46 ± 0.23 92. ± 0.20 91.84 ± 0.21 RCV1.85 ± 0.10 91.19 ± 0.07 91.10 ± 0.11 COVER 71.35 ± 0.33 72.31 ± 0.37 72.40 ± 0.25 Table 4.3: p-values fro Wilcoxon sign-rank test. Dataset OVR-CS OVR-WW NEWS20 0.43 0.01 SECTOR 0.01 3.9 MNIST 2.0 2.0 RCV1 2.0 2.0 COVER 2.0 2.0 the lack of data noralization and therefore direct ulticlass solutions ay be ore preferred in such situations. A ore detailed analysis is needed to check this. 5. CONCLUSION In this paper we have presented sequential descent ethods for the Craer-Singer and the Weston-Watkins ulticlass linear SVM forulations that are well suited for solving large scale probles. The basic idea of sequentially looking at one exaple at a tie and optiizing the dual variables associated with it can be extended to ore general SVM probles with structured outputs, such as taxonoy, ultilabel and sequence labeling probles. We note that, for the structured outputs setting, Tsochantaridis et al. [24] entioned the sequential dual ethod as a possibility (see Variant (b) in Algorith 1 of that paper), but did not pursue it as a potentially fast ethod. In such extensions the quadratic prograing sub-proble associated with each exaple does not have a siple for such as those in (10) and (21), and soe care is needed to solve this sub-proble efficiently. We are currently conducting experients on such extensions and will report results in a future paper. 6. REFERENCES [1] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving ulticlass support vector achines with LaRank. In ICML, 2007. [2] B. E. Boser, I. Guyon, and V. Vapnik. A training algorith for optial argin classifiers. In COLT, 1992. [3] L. Bottou. Stochastic gradient descent exaples, 2007. http://leon.bottou.org/projects/sgd. [4] M. Collins, A. Globerson, T. Koo, X. Carreras, and P. Bartlett. Exponentiated gradient algoriths for conditional rando fields and ax-argin Markov networks. JMLR, 2008. To appear. [5] K. Craer and Y. Singer. On the learnability and design of output codes for ulticlass probles. In COLT, 2000. [6] K. Craer and Y. Singer. On the algorithic ipleentation of ulticlass kernel-based vector achines. JMLR, 2:265 292, 2001. [7] K. Craer and Y. Singer. Ultraconservative online algoriths for ulticlass probles. JMLR, 3:951 991, 2003. [8] R. Fletcher. Practical Methods of Optiization. John Wiley and Sons, 1987. [9] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent ethod for large-scale linear SVM. In ICML, 2008. [10] C.-W. Hsu and C.-J. Lin. A coparison of ethods for ulti-class support vector achines. IEEE TNN, 13(2):415 425, 2002. [11] T. Joachis. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Sola, editors, Advances in Kernel Methods - Support Vector Learning, Cabridge, MA, 1998. MIT Press. [12] T. Joachis. Training linear SVMs in linear tie. In ACM KDD, 2006. [13] T. Joachis, T. Finley, and C.-N. J. Yu. Cutting plane training of structural SVMs. MLJ, 2008. [14] K. Lang. Newsweeder: Learning to filter netnews. In ICML, pages 331 339, 1995. [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docuent recognition. Proceedings of the IEEE, 86(11):2278 2324, Noveber 1998. MNIST database available at http://yann.lecun.co/exdb/nist/. [16] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchark collection for text categorization research. JMLR, 5:361 397, 2004. [17] C.-J. Lin. A foral analysis of stopping criteria of decoposition ethods for support vector achines. IEEE TNN, 13(5):1045 1052, 2002. [18] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent ethod for convex differentiable iniization. JOTA, 72(1):7 35, 1992. [19] A. McCallu and K. Niga. A coparison of event odels for naive bayes text classification. In Proceedings of the AAAI 98 Workshop on Learning for Text categorization, 1998. [20] J. D. M. Rennie and R. Rifkin. Iproving ulticlass text classification with the Support Vector Machine. Technical Report AIM-2001-026, MIT, 2001. [21] R. Rifkin and A. Klautau. In defense of one-vs-all classification. JMLR, 5:101 141, 2004. [22] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: prial estiated sub-gradient solver for SVM. In ICML, 2007. [23] C. H. Teo, A. Sola, S. V. Vishwanathan, and Q. V. Le. A scalable odular convex solver for regularized risk iniization. In ACM KDD, 2007. [24] I. Tsochantaridis, T. Joachis, T. Hofann, and Y. Altun. Large argin ethods for structured and interdependent output variables. JMLR, 2005. [25] J. Weston and C. Watkins. Multi-class support vector achines. In M. Verleysen, editor, Proceedings of ESANN99, Brussels, 1999. D. Facto Press.

10 7 10 1 10 2 10 3 88 86 84 82 78 76 74 72 10 1 10 2 10 3 95 85 75 10 1 10 2 10 3 65 10 1 10 2 10 3 40 30 20 10 1 10 2 10 3 95 85 75 65 55 10 1 10 2 10 3 10 2 10 4 10 2 10 4 75 65 55 45 40 10 1 10 2 10 3 10 4 10 5 35 10 1 10 2 10 3 10 4 10 5 Figure 3: Coparison of SDM (Red, dashdot), EG (Blue, dashed) and Bundle (Green, solid) on various datasets with C=1: The horizontal axis is Training tie in seconds. Each row corresponds to one dataset; Row 1: NEWS20; Row 2: SECTOR; Row 3: MNIST; Row 4: RCV1; and Row 5: COVER.