A Sequential Dual Method for Large Scale Multi-Class Linear SVMs

Size: px
Start display at page:

Download "A Sequential Dual Method for Large Scale Multi-Class Linear SVMs"

Transcription

1 A Sequential Dual Method for Large Scale Multi-Class Linear SVMs Kai-Wei Chang Dept. of Coputer Science National Taiwan University Taipei 106, Taiwan S. Sathiya Keerthi Yahoo! Research Santa Clara, CA Cho-Jui Hsieh Dept. of Coputer Science National Taiwan University Taipei 106, Taiwan S. Sundararajan Yahoo! Labs Bangalore, India Chih-Jen Lin Dept. of Coputer Science National Taiwan University Taipei 106, Taiwan ABSTRACT Efficient training of direct ulti-class forulations of linear Support Vector Machines is very useful in applications such as text classification with a huge nuber exaples as well as features. This paper presents a fast dual ethod for this training. The ain idea is to sequentially traverse through the training set and optiize the dual variables associated with one exaple at a tie. The speed of training is enhanced further by shrinking and cooling heuristics. Experients indicate that our ethod is uch faster than state of the art solvers such as bundle, cutting plane and exponentiated gradient ethods. Categories and Subject Descriptors I.5.2 [Pattern Recognition]: Design Methodology Classifier design and evaluation General Ters Algoriths, Perforance, Experientation 1. INTRODUCTION Support vector achines (SVM) [2] are popular ethods for solving classification probles. An SVM eploying a nonlinear kernel aps the input x to a high diensional feature space via a nonlinear function φ(x) and fors a linear classifier in that space to solve the proble. Inference as well as design are carried out using the kernel function, K(x i, x j) = φ(x i) T φ(x j). In several data ining applications the input vector x itself has rich features and so it is sufficient to take φ as the identity ap, i.e. φ(x) = x. SVMs developed in this setting are referred to as Linear SVMs. Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. KDD 08, August 24 27, 2008, Las Vegas, Nevada, USA. Copyright 2008 ACM /08/08...$5.00. In applications such as text classification the input vector x resides in a high diensional space (e.g. a bag-of-words representation), the training set has a large nuber of labeled exaples, and the training data atrix is sparse, i.e. the average nuber of non-zero eleents in one x is sall. Most classification probles arising in such doains involve any classes. Efficient learning of the paraeters of linear SVMs for such probles is iportant. The One-Versus-Rest (OVR) ethod which consists of developing, for each class, a binary classifier that separates that class fro the rest of the classes, and then cobining the classifiers for ulti-class inference, is one of the popular schees to solve a ulti-class proble. OVR is easy and efficient to design and also gives good perforance [21]. Craer and Singer [5, 6] and Weston and Watkins [25] gave direct ulti-class SVM forulations. With nonlinear kernels in ind, Craer and Singer [5, 6] and Hsu and Lin [10] gave efficient decoposition ethods for these forulations. Recently, soe good ethods have been proposed for SVMs with structured outputs, that also specialize nicely for Craer-Singer ulti-class linear SVMs. Teo et al. [23] suggested a bundle ethod and Joachis et al. [13] gave a cutting plane ethod that is very close to it; these ethods can be viewed as extensions of the ethod given by Joachis [12] for binary linear SVMs. Collins et al. [4] gave the exponentiated gradient ethod. In this paper we present fast dual ethods for Craer- Singer and Weston-Watkins forulations. Like the ethods of Craer and Singer [5, 6] and Hsu and Lin [10] our ethods optiize the dual variables associated with one exaple at a tie. However, while those previous ethods always choose the exaple that violates optiality the ost for updating, our ethods sequentially traverse through the exaples. For the linear SVM case this leads to significant differences. In particular, for this case our ethods are extreely efficient while those ethods are not so efficient. We borrow and odify the shrinking and cooling heuristics of Craer and Singer [6] to ake our ethods even faster. Experients on several datasets indicate that our ethod is also uch faster than cutting plane, bundle and exponentiated gradient ethods. For the special case of binary classification our ethods turn out to be equivalent to doing coordinate descent. We covered this ethod in detail in [9].

2 Table 1.1: Properties of datasets Dataset l l test n k # nonzeros in whole data NEWS20 15,935 3,993 62, ,593,692 SECTOR 6,412 3,207 55, ,569,4 MNIST,000 10, ,5,375 RCV1 531,742 15,913 47, ,734,621 COVER 522,911 58, ,972,144 Notations. The following notations are used in the paper. We use l to denote the nuber of training exaples, l test to denote the nuber of test exaples and k to denote the nuber of classes. Throughout the paper, the index i will denote a training exaple and the index will denote a class. Unless otherwise entioned, i will run fro 1 to l and will run fro 1 to k. x i R n is the input vector (n is the nuber of input features) associated with the i-th exaple and y i {1,..., k} is the corresponding target class. We will use 1 to denote a vector of all 1 s whose diension will be clear fro the context. Superscript T will denote transpose for vectors. The following ulti-class datasets are used in the paper for doing various experients: NEWS20 [14], SECTOR [20, 19], MNIST [15], RCV1 [16] and COVER [13]. RCV1 is a taxonoy based ulti-label dataset and so we odified it to a ulti-class dataset by reoving all exaples which had isatched labels in the taxonoy tree; in the process of doing this, two classes do not have any training exaples. For NEWS20, MNIST and RCV1 we used the standard train-test split; for SECTOR we used the second train/test split used in [20]; and, for COVER we used the train/test split used in [13]. Table 1.1 gives key properties of these datasets. The paper is organized as follows. In section 2 we give the details of our ethod (we call it as the Sequential Dual Method (SDM)) for the Craer-Singer forulation and evaluate it against other ethods. In section 3 we describe SDM for the Weston-Watkins forulation. In section 4 we copare these ethods with the One-Versus-Rest ethod of solving ulti-class probles. We conclude the paper in section 5. An ipleentation of a variant of the SDM algorith for Craer-Singer forulation is available in the LIBLINEAR library (version 1.3 or higher) with the option -s 4 ( 2. SDM FOR CRAMMER-SINGER FORMU- LATION In [5, 6], Craer and Singer proposed an approach for the ulti-class proble by forulating the following prial proble: in {w },{ξ i } P 1 2 w 2 + C P i ξi s.t. w T y i x i w T x i e i ξ i, i, (1) where C > 0 is the regularization paraeter, w is the weight vector associated with class, e i = 1 δ yi, and δ yi, = 1 if y i =, δ yi, = 0 if y i. Note that, in (1) the constraint for = y i corresponds to the non-negativity constraint, ξ i 0. The decision function is arg ax wt x. (2) The dual proble of (1), developed along the lines of [5, 6], involves a vector α having dual variables αi, i. The w get defined via α as w (α) = X i α i x i (3) At ost places below, we siply write w (α) as w. Let Ci = 0 if y i, Ci = C if y i =. The dual proble is P in f(α) = 1 α 2 w(α) 2 + P P i e i αi s.t. (αi Ci P, α i = 0) i, (4) The gradient of f plays an iportant role and is given by g i = f(α) α i = w (α) T x i + e i i,. (5) If n is the average nuber of nonzero eleents per training exaple, then on average the evaluation of each gi takes O( n) effort. Optiality of α for (4) can be checked using the quantity, v i = ax g i in :α i <C i g i i. (6) For a given i, the values of that attain the ax and in in (6) play a useful role and are denoted as follows: M i = arg ax g i and i = arg in :α i <C i g i. (7) Fro (6) it is clear that v i is non-negative. Dual optiality holds when: v i = 0 i. (8) For practical terination we can approxiately check optiality using a tolerance paraeter, ǫ > 0: v i < ǫ i. (9) We will also refer to this as ǫ-optiality. The value ǫ = 0.1 is a good choice for a practical ipleentation of SDM. SDM consists of sequentially picking one i at a tie and solving the restricted proble of optiizing only αi, keeping all other variables fixed. To do this, we let δαi denote the additive change to be applied to the current αi, and optiize δαi. Let α i, δα i, g i and C i be vectors that respectively gather the eleents αi, δαi, gi and Ci over all. With A i = x i 2 the sub-proble of optiizing δα i is given by 1 in δα 2 Ai δαi 2 + g T i δα i i s.t. δα i C i α i, 1 T δα i = 0 (10) This can be derived by noting the following: (i) changing α i to α i + δα i causes w to change to w + δα i x i; (ii) w + δα i x i 2 = w 2 + A i(δα i ) 2 + 2(w T x i)δα i ; (iii) e i α i changes to e i α i + e i δα i ; (iv) using the definition When specialized to binary classification (k = 2), for each i the equality constraint in (4) can be used to eliinate the variable αi, y i. With this done and the dual solution viewed as optiization for the variables {α y i i }i, SDM is equivalent to doing coordinate descent. This is what we did in [9].

3 Algorith 2.1 SDM for Craer-Singer Forulation Initialize α and the corresponding w. Until (8) holds in an entire loop over exaples, do: For i = 1,..., l (a) Copute gi and obtain v i (b) If v i 0, solve (10) and set: α i α i + δα i w w + δαi x i of gi in (5); and (v) using the above in (4) and leaving out all constants that do not depend on δα i. The sub-proble in (10) has a nice siple for. Let us discuss ethods for solving it. Suppose A i = 0 for soe i, which can happen only when x i = 0. In the prial proble (1) such exaples contribute a fixed cost of C to the objective function, they have no effect on the w, and so they can be effectively ignored. So, hereafter we will assue A i > 0 i. Craer and Singer [5, 6] suggest two ethods for solving (10): (i) an exact O(k log k) algorith (see section 6.2 of [5]) and (ii) an approxiate iterative fixed-point algorith (see section 6 of [6]). Alternatively, one could also eploy an active set ethod eant for convex quadratic prograing probles [8], starting fro δα i = 0 as the initial point. Since the Hessian of the objective function in (10) is A i ties the identity atrix and the constraints of (10) are in a siple for, various linear equation solving steps of the active set ethod can be done analytically and cheaply. Our ipleentation uses this ethod. A description of SDM is given in Algorith 2.1. If good seed values are unavailable, a siple way to initialize is to set α = 0; this corresponds to w = 0. Let us now discuss the coplexity of the algorith. Each execution of step (a) requires O(k n) effort; recall that n is the average nuber of nonzero eleents per training exaple. The cost of step (b) is at ost O(k n), and it could be uch less depending on the nuber of αi that are actually changed during the solution of (10). Since the solution of (10) itself is usually not as high as these costs if k is not very large, it is reasonable to take the cost of an update for one i as O(k n). Thus the overall cost of a loop, i.e. a full sequence of updates over all exaples is O(lk n). The ain eory requireent of the algorith consists of storing the data, x i i (O(l n)), the weights, w (O(kn)) and the αi i, (at ost O(kl)). For probles with a large nuber of classes but having only a sall nuber of non-zero αi, it is prudent to use a linked list data structure and store only those αi which are non-zero. The following convergence theore can be proved for SDM, eploying ideas siilar to those in [17]. Theore 2.1 Let α t denote the α at the end of the t-th loop of Algorith 2.1. Any liit point of a convergent subsequence of {α t } is a global iniu of (4). Because of space liitation we oit the proof. In [9] we showed that SDM for binary classification has a linear rate of convergence, i.e., there exists 0 µ < 1 and a t 0 such that f(α t ) f(α ) µ t t 0 (f(α t 0 ) f(α )), t t 0, (11) where f is the dual objective function, t is the loop count in the algorith, α t is the α at the end of the t-th loop, and α is the dual optial solution. On the other hand Theore 2.1 only iplies a weaker for of convergence for Algorith 2.1. As we rearked earlier, for binary classification SDM becoes coordinate descent, a standard ethod for which good convergence results exist in the optiization literature [18]. The presence of the equality constraint in (4) akes the ulti-class version soewhat non-standard. It is an interesting open proble to establish linear convergence for Algorith 2.1. Let us now discuss various ways of aking Algorith 2.1 ore efficient. In the algorith the exaples are considered in the given order i = 1, 2,..., l for updating. Any systeatic regularities in the given order (for instance, all exaples of one class coing together) ay lead to slow convergence. So it is a good idea to randoly perute the exaples. In fact it is useful to do this rando perutation before each loop through the exaples. We do this in our ipleentation of SDM. In [9] we found the sae idea to be useful for binary classification. Theore 2.1 is valid even in the presence of such rando perutations. In the solution of the sub-proble (10), the variables, α i corresponding to the ost violating indices {M i, i} (see (7)) are particularly iportant. When the nuber of classes is large, for efficiency one could consider solving the sub-proble only for these two variables. An extended version of this idea is the following. Since the inactive variables, i.e., : α i < C i are iportant, we can optiize on the variables α i for = M i and : α i < C i. In ost probles the nuber of inactive variables per exaple is quite sall (even when the nuber of classes is large) and so we use this ethod in our ipleentation. 2.1 Heuristics for iproving efficiency Craer and Singer [6] eployed two heuristics to speed up their algorith. We eploy these heuristics for SDM too. We now discuss these heuristics. Shrinking. Shrinking [11], a technique for reoving variables that are not expected to change, is known to be a very good way to speed up decoposition ethods for binary SVMs. In the ulti-class case, it ay turn out that for any exaples αi = 0. So, for efficiency Craer and Singer [6] suggested the idea of concentrating the algorith s effort on exaples i for which there is at least one non-zero αi. (If, for a given i there is at least one nonzero αi, then it is useful to observe the following using the special nature of the constraints in the dual proble: (i) there are at least two non-zero αi ; and (ii) α y i i > 0.) We can get further iproved efficiency by doing two odifications. The first odification is to also reove those i for which there is exactly one y i with αi = α y i i = C. (In other words, apart fro reoving those i with αi = 0, we are also reoving those i with exactly two non-zero αi each having agnitude C.) At optiality this generically corresponds to arg ax yi w T x i being a singleton and ξ i = w T x i w T y i x i + 1 > 0. As we approach optiality, for the that attains the arg ax above we expect αi to stay at C, α y i i to stay at C and αi = 0 for all other. If there are any such exaples, then this odification will give good benefits. The second odification is the follow- This corresponds to the observation that for any given exaple the confusion in classification is only aongst a few classes.

4 ing: even within an exaple i that does not get reoved, consider for updating only those αi which are non-zero. Efficiency iproveent due to this odification can be substantial when the average nuber of non-zero αi for each exaple is uch saller than the nuber of classes. With these odifications included in shrinking SDM is changed as follows. We alternate between the following two steps. 1. First do a full sequential loop over all exaples. 2. Do several loops over only those exaples i that do not satisfy the condition for reoval discussed above. When doing this, for each i that is included, evaluate gi and optiize αi only for those that have nonzero αi. Repeat these shrunk loops until either ǫ- optiality is satisfied on all the positive αi or the effort associated with these loops exceeds a specified aount. Cooling of accuracy paraeter. Suppose we desire to solve the dual proble quite accurately using a sall value of ǫ. When shrinking is eployed there is a tendency for the algorith to spend a lot of effort in the shrunk loops. The algorith has to wait for the next full loop (where other optiality-violating αi get included) to ake a decent iproveent in the dual objective function. To avoid a staircase behavior of the objective function with coputing tie, it is a good idea to start with a large value for ǫ and decrease it in several graded steps until the final desired ǫ is reached. In our ipleentation we start fro ǫ = 1 and decrease it by a factor of 10 in each graded step. Figure 1 deonstrates the effectiveness of including the heuristics on two datasets. Clearly, the heuristics are valuable when (4) needs to be solved quite accurately. 2.2 Coparison with Craer and Singer s ethod In [5, 6], Craer and Singer gave a ethod for solving (4) with nonlinear kernels in ind. Their ethod also uses the basic idea of choosing an i and solving (10) to update α i. Unlike our ethod which does a sequential traversal of the i, Craer and Singer s ethod always chooses the i for which the optiality violation v i in (6) is largest at the current α. To appreciate Craer and Singer s otivation for doing this, first note that the gi are key quantities for calculating the v i as well as for setting up (10). So let us look at the coputational coplexity associated with gi in soe detail. For probles with a nonlinear kernel K(x j, x i), the gi need to be coputed using g i = X j α j K(x j, x i) + e i i,. (12) Let us recall the notations g i, α i and δα i given above (10). Direct coputation of g i for one given i via (12) takes O(lk n) effort. On the other hand, suppose we aintain g j j. In a typical step of Craer and Singer s ethod, an i is chosen and α i is changed as α i α i + δα i; in such a case the g j can be updated using g j g j + K(x j, x i)δα i j. (13) In our ipleentation we count the total nuber of g i evaluations, divide it by kl to get the nuber of effective loops and terinate the shrunk loops when the nuber of effective loops exceeds Figure 1: Coparison of SDM with (Red, dashdot) and without (Blue, solid) the shrinking and cooling heuristics on MNIST (top) and COVER (botto) datasets. The horizontal axis is Training tie in seconds. which also requires (total, for all j) only O(lk n) effort. Therefore aintaining g j j does not cost ore than the direct coputation of a single g i using (12). As using g j j iplies fewer iterations (i.e., faster convergence due to the ability to choose for updating the i that violates optiality ost), of course one should take this advantage. This clearly otivates the schee for choosing i in Craer and Singer s ethod. However, the situation for linear SVM is very different. With the cheap O(k n) way (g i = w T x i + e i ) of coputing g i and the cheap O(k n) way (w w + δα i x i ) of updating the w, each update of α i takes only O(k n) effort, and so, if l is large, SDM can do any ore (O(l)) updates of the w for the sae effort as needed by Craer and Singer s ethod for one update of the w. Hence, for linear SVM, always choosing the i that yields the axiu v i is unwise and sequential traversal of the i is a uch better choice. 2.3 Coparison with other ethods We now copare our ethod with other ethods for solving (1). The cutting plane ethod of Joachis et al. [13] and the bundle ethod of Teo et al. [23] are given for the ore general setting of SVMs with structured outputs, but they also specialize to ulti-class linear SVMs using the Craer-Singer forulation. These ethods cobine the loss ter C P i ξi in (1) into a single loss function and build convex, piecewise linear lower bounding approxiations for this function by using sub-differentials of it fro various points encountered during the solution. A negative aspect

5 Figure 2: Coparison of SDM (Red, dashdot), EG (Blue, dashed) and Bundle (Green, solid) on MNIST dataset. The horizontal axis is Training tie in seconds. Each row corresponds to one C value; Row 1: C = 10; Row 2: C = 0.1 of these algoriths is their need to scan the entire training set in order to for one sub-differential entioned above; as we will see below in the experients, this causes these algoriths to be slow to converge. Barring soe differences in ipleentation the algoriths in [13] and [23] are the sae and we will siply refer to the jointly as the Cutting Plane (CP) ethod. For coparison we use the ipleentation in sv_light/sv_ulticlass.htl. The Exponentiated Gradient (EG) ethod [4] also applies to structured output probles and specializes to the Craer-Singer ulti-class linear SVM forulation. It uses a dual for that is slightly different fro (4), but is equivalent to it by a siple translation and scaling of variables. Like our ethod the EG ethod updates the dual variables associated with one exaple at a tie, but the selection of exaples is done in an online ode where an i is picked randoly each tie. The EG ethod does an approxiate descent via a constraint-obeying exponentiated gradient step. By its very design all dual inequality constraints reain inactive (i.e., α i < C i i, ) throughout the algorith, though, as optiality is reached soe of the ay asyptotically becoe active. The EG ethod is very straightforward to ipleent and we use our own ipleentation for the experients given below. We now copare our ethod (SDM) against CP and EG on the five datasets used in this paper, both with respect to the ability to converge well to the optiu of (4) as well as the ability to reach steady state test set perforance fast. We evaluate the forer by the Relative Function Value Difference given by (f f )/f where f is the optial objective function value of (4), and the latter by Test Set Accuracy which is the percentage of test exaples that are classified correctly. For siplicity we set C = 1 for this experient. Figure 3 (see the last page of the paper) gives plots of these two quantities as a function of training tie for the five datasets. Overall, SDM is uch faster than EG, which in turn is uch faster than CP. This is true irrespective of whether the ai is to efficiently reach the optiu dual objective very closely or reach the steady state test accuracy fast. On COVER (see the botto left row of Figure 3) EG has difficulty reaching the optial objective function value, which is possibly due to nuerical issues associated with large exponential operations involved in its updates. When the regularization paraeter C needs to be tuned, say via cross-validation techniques, it is necessary to solve (4) for a range of C values. It is iportant for an algorith to be efficient for widely varying C values. For one dataset, MNIST, Figure 2 copares the three ethods for large and sall C values (C = 10 and C = 0.1); for the interediate value of C = 1, see also the plots for MNIST in Figure 3. When actually finding solutions for an ordered set of several C values, the optial α and the w obtained fro the algorith for one C value can be used to for initializations of corresponding quantities of the algorith for the next C value. This is a standard seeding technique that iproves efficiency, and is also used in the EG ethod [4]. Stochastic gradient descent ethods [22, 3] can be applied to the online version of (1) and they are known to achieve steady state test perforance quite fast. We have not evaluated SDM against these ethods. However, our evaluation in [9] shows that, for binary classification SDM is faster than stochastic gradient ethods. So we expect the sae result to hold for the ulti-class case too. The odified MIRA algorith proposed by Craer and Singer [7] has soe siilarities with our sequential dual ethod, but it is given in an online setting and does not solve the batch proble. Also, ideas such as shrinking do not apply to it. Bordes et al. [1] apply a coordinate descent ethod for ulti-class SVM, but they focus on nonlinear kernels.

6 3. SDM FOR WESTON-WATKINS FORMU- LATION Let e i and δ i be as defined in section 2. In [25], Weston and Watkins proposed an approach for the ulti-class proble by forulating the following prial proble: 1 X in w 2 + C X X {w },{ξ i } 2 i y i ξ i s.t. w T y i x i w T x i e i ξ i, ξ i 0 y i, i. (14) The decision function reains the sae as in (2). The dual proble of (14) involves a vector α having dual variables αi i, y i. For convenience let us set and define α y i i = X y i α i, w (α) = X i The dual of (14) can now be written as P w(α) 2 + P i α i x i. (15) P y i α i in α f(α) = 1 2 s.t. 0 αi C y i, i. (16) The gradient of f is given by gi = f(α) = w αi yi (α) T x i w (α) T x i 1 i, y i. (17) Optiality can be checked using vi, y i, defined as: 8 >< gi if 0 < αi < C, vi = ax(0, gi ) if αi = 0, (18) >: ax(0, gi ) if αi = C. Clearly vi 0. Optiality holds when: vi = 0 i, i. (19) For practical terination we can approxiately check this using a tolerance paraeter, ǫ > 0: v i < ǫ i, i. (20) The value ǫ = 0.1 is a good choice for a practical ipleentation. Like in section 2, the Sequential Dual Method (SDM) for the Weston-Watkins forulation consists of sequentially picking one i at a tie and solving the restricted proble of optiizing only αi y i. To do this, we let δαi denote the additive change to be applied to the current αi, and optiize δαi y i. Let α i, δα i and g i be vectors that respectively gather the eleents αi, δαi and gi over all y i. With A i = x i 2 the sub-proble of optiizing δα i is given by 1 in δα 2 Ai δαi Ai(1T δα i) 2 + g T i δα i i s.t. 0 δα i C1 α i (21) Like (10), the sub-proble (21) too has a siple for. We solve it also using the active set ethod. We note that the Hessian of the objective function in (21) is A i(i + 11 T ) (where I is the identity atrix), and so, any of its block diagonal sub-atrices can be analytically and cheaply inverted. Therefore the active-set ethod is very efficient and we used Algorith 3.1 SDM for Weston-Watkins Forulation Initialize α and the corresponding w. Until (19) holds in an entire loop over exaples do: For i = 1,..., l (a) Copute gi y i and obtain v i (b) If ax vi 0, solve (21) and set: αi αi + δαi y i Set δα y i i = P y i δαi. w w δαi x i Figure 4: Coparison of SDM (Weston-Watkins) with (Red, dashdot) and without (Blue, solid) the shrinking and cooling heuristics on MNIST (top) and COVER (botto) datasets. The horizontal axis is Training tie in seconds. this ethod in our ipleentation. A coplete description of SDM for the Weston-Watkins forulation is given in Algorith 3.1. The initialization, as well as tie and eory coplexity analysis given earlier for Craer-Singer SDM also hold for this algorith. Turning to convergence, let us assue, like in section 2, that A i > 0 i. Unlike (4), the dual (16) does not have any equality constraints. Algorith 3.1 is a block coordinate descent ethod for which convergence results fro the optiization literature (see [18] and in particular section 6 there) can be applied to show that Algorith 3.1 has linear convergence. Finally, siilar to section 2.2 we can argue that, the ethod of Hsu and Lin [10] is suited for nonlinear kernels, but SDM is better for the linear SVM case. For efficiency, (21) can be solved only for soe restricted variables, say only the δαi for which vi > 0. The heuristics given for Craer-Singer SDM can be extended to the Weston-Watkins version too. In the shrinking phase, for each i, aong all i we only copute gi and update

7 α i for those which satisfy 0 < α i < C. Figure 4 shows the effectiveness of including the heuristics on MNIST and COVER datasets. 4. COMPARISON WITH ONE-VERSUS-REST The One-Versus-Rest (OVR) ethod which consists of developing, for each class, a binary classifier (w ) that separates class fro the rest of the classes, and then using (2) for inference, is one of the popular schees to solve a ulticlass proble. For nonlinear SVMs, Rifkin and Klatau [21] argue that OVR is as good as any other approach if the binary classifiers are well-tuned regularized classifiers like SVMs. The ain points in defense of OVR (and against direct ulti-class ethods) are: (1) coplicated ipleentation of direct ulti-class ethods, (2) their slow training, and (3) OVR yields accuracy siilar to that given by direct ulti-class ethods. These points should be viewed differently when considering linear SVMs. Firstly, a ethod such as SDM is easy to ipleent. Secondly, with respect to training speed, the following siplistic analysis gives insight. It is reasonable to take the cost of solution of a dual proble (whether it is a binary or a direct ulti-class one) to be a function T(l d ) where l d is the nuber of dual variables in that proble. In OVR we solve k binary probles each of which has l dual variables and so its cost is kt(l). For a direct ulti-class solution, e.g. Craer-Singer, the nuber of dual variables is kl and therefore its cost is T(kl). For nonlinear SVMs T( ) is a nonlinear function (soe epirical studies have shown it to be close to a quadratic) and so OVR solution is uch faster than direct ulti-class solutions. On the other hand, for linear SVMs, T( ) tends to be a linear function and so the speeds of OVR and direct ulticlass solutions are close. We observe this in an experient given below. Thirdly, perforance differences between OVR and direct ulti-class ethods is dependent on the dataset. The experients below offer ore insight on this aspect. First we do an ipleentation of OVR in which each SVM binary classifier eploying the standard hinge loss is solved using the binary version of SDM, as developed in [9]. By the results in [9] this is probably the fastest solver for OVR for the types of datasets considered in this paper. We begin with a siple experient where we fix C = 1 and solve the ulti-class probles using OVR and SDM for Craer- Singer and Weston-Watkins forulations. For MNIST and COVER datasets Figure 5 gives the behavior of test set perforance as the solutions progress in tie. For the OVR ethod test set perforance was evaluated after each loop of the training set in each binary classifier solution, whereas, for the direct ulti-class SDM the evaluation was done on a finer scale, four ties within one loop. Clearly, all three ethods take nearly the sae aount of coputing tie to reach their respective steady state values. This is in agreeent with our earlier ention that, for linear SVMs OVR and direct ulti-class solutions have siilar coplexity. Also, in these datasets the final test set accuracies of Craer-Singer and Weston-Watkins forulations are better than the accuracy of OVR. Since accuracy depends on the choice of C, we do a ore detailed study on the accuracy aspects further. For each of This can be seen fro the coplexity analysis of section 2 coupled with the assuption that a fixed sall nuber of loops gives good enough convergence. Test Accuracy (%) Test Accuracy (%) Figure 5: Coparison of One-Versus-Rest (Green, solid), Craer-Singer (Red, dashdot) and Weston- Watkins (Blue, dashed) on MNIST (top) and COVER (botto) datasets. The horizontal axis is Training tie in seconds. the 5 datasets, we cobined the training and test datasets and ake 10 rando %/20%/20% train/validation/test class-stratified partitions. For each of the three ulti-class ethods the C paraeter is tuned using the train and validation sets by considering the following set of C values: {0.01, 0.03, 0.1, 0.3, 1, 3, 10}. For the best C value that gives the largest validation set accuracy, a re-training is done on the train+validation set and the resulting classifier is evaluated on the test set. The ean and standard deviation values of test set accuracies are reported in Table 4.2. We conduct Wilcoxon sign-rank test with a significance level of 0.01 to copare OVR against the direct ulti-class ethods on each dataset; see Table 4.3. Coparison of OVR and Craer-Singer indicates that the observed differences are statistically significant in favor of Craer-Singer for SECTOR, MNIST, RCV1 and COVER datasets. Coparison of OVR and Weston-Watkins indicates that the results are statistically significant for all the datasets. However, on NEWS20 and SECTOR the results are in favor of OVR while on the other three datasets they are in favor of Weston-Watkins. Though statistical significance analysis ay not ean practically significant iproveents, we do note that the iproveents shown by the direct ulticlass ethods over OVR on MNIST and COVER datasets are quite good. Overall, for linear SVMs it ay be better to eploy the direct ulti-class ethods (in particular, Craer-Singer) whenever fast solvers for the (such as ones based on SDM) are available. It is worth noting that, of the five datasets under consideration only MNIST and COVER are un-noralized, i.e. x i is not the sae for all i. It is possible that OVR s perforance is ore affected by

8 Table 4.2: Test set accuracy (Mean and Standard deviation) coparison of One-Versus-Rest (OVR), and Multi-class SDMs for Craer-Singer (CS) and Weston-Watkins (WW) forulations with C tuned for each ethod. Dataset OVR CS WW NEWS ± ± ± 0.43 SECTOR ± ± ± 0.57 MNIST ± ± ± 0.21 RCV1.85 ± ± ± 0.11 COVER ± ± ± 0.25 Table 4.3: p-values fro Wilcoxon sign-rank test. Dataset OVR-CS OVR-WW NEWS SECTOR MNIST RCV COVER the lack of data noralization and therefore direct ulticlass solutions ay be ore preferred in such situations. A ore detailed analysis is needed to check this. 5. CONCLUSION In this paper we have presented sequential descent ethods for the Craer-Singer and the Weston-Watkins ulticlass linear SVM forulations that are well suited for solving large scale probles. The basic idea of sequentially looking at one exaple at a tie and optiizing the dual variables associated with it can be extended to ore general SVM probles with structured outputs, such as taxonoy, ultilabel and sequence labeling probles. We note that, for the structured outputs setting, Tsochantaridis et al. [24] entioned the sequential dual ethod as a possibility (see Variant (b) in Algorith 1 of that paper), but did not pursue it as a potentially fast ethod. In such extensions the quadratic prograing sub-proble associated with each exaple does not have a siple for such as those in (10) and (21), and soe care is needed to solve this sub-proble efficiently. We are currently conducting experients on such extensions and will report results in a future paper. 6. REFERENCES [1] A. Bordes, L. Bottou, P. Gallinari, and J. Weston. Solving ulticlass support vector achines with LaRank. In ICML, [2] B. E. Boser, I. Guyon, and V. Vapnik. A training algorith for optial argin classifiers. In COLT, [3] L. Bottou. Stochastic gradient descent exaples, [4] M. Collins, A. Globerson, T. Koo, X. Carreras, and P. Bartlett. Exponentiated gradient algoriths for conditional rando fields and ax-argin Markov networks. JMLR, To appear. [5] K. Craer and Y. Singer. On the learnability and design of output codes for ulticlass probles. In COLT, [6] K. Craer and Y. Singer. On the algorithic ipleentation of ulticlass kernel-based vector achines. JMLR, 2: , [7] K. Craer and Y. Singer. Ultraconservative online algoriths for ulticlass probles. JMLR, 3: , [8] R. Fletcher. Practical Methods of Optiization. John Wiley and Sons, [9] C.-J. Hsieh, K.-W. Chang, C.-J. Lin, S. S. Keerthi, and S. Sundararajan. A dual coordinate descent ethod for large-scale linear SVM. In ICML, [10] C.-W. Hsu and C.-J. Lin. A coparison of ethods for ulti-class support vector achines. IEEE TNN, 13(2): , [11] T. Joachis. Making large-scale SVM learning practical. In B. Schölkopf, C. J. C. Burges, and A. J. Sola, editors, Advances in Kernel Methods - Support Vector Learning, Cabridge, MA, MIT Press. [12] T. Joachis. Training linear SVMs in linear tie. In ACM KDD, [13] T. Joachis, T. Finley, and C.-N. J. Yu. Cutting plane training of structural SVMs. MLJ, [14] K. Lang. Newsweeder: Learning to filter netnews. In ICML, pages , [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docuent recognition. Proceedings of the IEEE, 86(11): , Noveber MNIST database available at [16] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchark collection for text categorization research. JMLR, 5: , [17] C.-J. Lin. A foral analysis of stopping criteria of decoposition ethods for support vector achines. IEEE TNN, 13(5): , [18] Z.-Q. Luo and P. Tseng. On the convergence of coordinate descent ethod for convex differentiable iniization. JOTA, 72(1):7 35, [19] A. McCallu and K. Niga. A coparison of event odels for naive bayes text classification. In Proceedings of the AAAI 98 Workshop on Learning for Text categorization, [20] J. D. M. Rennie and R. Rifkin. Iproving ulticlass text classification with the Support Vector Machine. Technical Report AIM , MIT, [21] R. Rifkin and A. Klautau. In defense of one-vs-all classification. JMLR, 5: , [22] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: prial estiated sub-gradient solver for SVM. In ICML, [23] C. H. Teo, A. Sola, S. V. Vishwanathan, and Q. V. Le. A scalable odular convex solver for regularized risk iniization. In ACM KDD, [24] I. Tsochantaridis, T. Joachis, T. Hofann, and Y. Altun. Large argin ethods for structured and interdependent output variables. JMLR, [25] J. Weston and C. Watkins. Multi-class support vector achines. In M. Verleysen, editor, Proceedings of ESANN99, Brussels, D. Facto Press.

9 Figure 3: Coparison of SDM (Red, dashdot), EG (Blue, dashed) and Bundle (Green, solid) on various datasets with C=1: The horizontal axis is Training tie in seconds. Each row corresponds to one dataset; Row 1: NEWS20; Row 2: SECTOR; Row 3: MNIST; Row 4: RCV1; and Row 5: COVER.

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

SimpleMKL. hal , version 1-26 Jan Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France

SimpleMKL. hal , version 1-26 Jan Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France SipleMKL Alain Rakotoaonjy LITIS EA 48 Université de Rouen 768 Saint Etienne du Rouvray, France Francis Bach INRIA - Willow project Départeent d Inforatique, Ecole Norale Supérieure 45, Rue d Ul 7523 Paris,

More information

Lecture 9: Multi Kernel SVM

Lecture 9: Multi Kernel SVM Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 204 April 6, 204 Roadap Tuning the kernel: MKL The ultiple kernel proble Sparse kernel achines for regression: SVR SipleMKL:

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 06, Taiwan S. Sathiya Keerthi

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space Journal of Machine Learning Research 3 (2003) 1333-1356 Subitted 5/02; Published 3/03 Grafting: Fast, Increental Feature Selection by Gradient Descent in Function Space Sion Perkins Space and Reote Sensing

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Interactive Markov Models of Evolutionary Algorithms

Interactive Markov Models of Evolutionary Algorithms Cleveland State University EngagedScholarship@CSU Electrical Engineering & Coputer Science Faculty Publications Electrical Engineering & Coputer Science Departent 2015 Interactive Markov Models of Evolutionary

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

An improved self-adaptive harmony search algorithm for joint replenishment problems

An improved self-adaptive harmony search algorithm for joint replenishment problems An iproved self-adaptive harony search algorith for joint replenishent probles Lin Wang School of Manageent, Huazhong University of Science & Technology zhoulearner@gail.co Xiaojian Zhou School of Manageent,

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup) Recovering Data fro Underdeterined Quadratic Measureents (CS 229a Project: Final Writeup) Mahdi Soltanolkotabi Deceber 16, 2011 1 Introduction Data that arises fro engineering applications often contains

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair Proceedings of the 6th SEAS International Conference on Siulation, Modelling and Optiization, Lisbon, Portugal, Septeber -4, 006 0 A Siplified Analytical Approach for Efficiency Evaluation of the eaving

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost

More information

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x) 7Applying Nelder Mead s Optiization Algorith APPLYING NELDER MEAD S OPTIMIZATION ALGORITHM FOR MULTIPLE GLOBAL MINIMA Abstract Ştefan ŞTEFĂNESCU * The iterative deterinistic optiization ethod could not

More information

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials Fast Montgoery-like Square Root Coputation over GF( ) for All Trinoials Yin Li a, Yu Zhang a, a Departent of Coputer Science and Technology, Xinyang Noral University, Henan, P.R.China Abstract This letter

More information

Analyzing Simulation Results

Analyzing Simulation Results Analyzing Siulation Results Dr. John Mellor-Cruey Departent of Coputer Science Rice University johnc@cs.rice.edu COMP 528 Lecture 20 31 March 2005 Topics for Today Model verification Model validation Transient

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Multiple Instance Learning with Query Bags

Multiple Instance Learning with Query Bags Multiple Instance Learning with Query Bags Boris Babenko UC San Diego bbabenko@cs.ucsd.edu Piotr Dollár California Institute of Technology pdollar@caltech.edu Serge Belongie UC San Diego sjb@cs.ucsd.edu

More information

E. Alpaydın AERFAISS

E. Alpaydın AERFAISS E. Alpaydın AERFAISS 00 Introduction Questions: Is the error rate of y classifier less than %? Is k-nn ore accurate than MLP? Does having PCA before iprove accuracy? Which kernel leads to highest accuracy

More information

A method to determine relative stroke detection efficiencies from multiplicity distributions

A method to determine relative stroke detection efficiencies from multiplicity distributions A ethod to deterine relative stroke detection eiciencies ro ultiplicity distributions Schulz W. and Cuins K. 2. Austrian Lightning Detection and Inoration Syste (ALDIS), Kahlenberger Str.2A, 90 Vienna,

More information

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels

Extension of CSRSM for the Parametric Study of the Face Stability of Pressurized Tunnels Extension of CSRSM for the Paraetric Study of the Face Stability of Pressurized Tunnels Guilhe Mollon 1, Daniel Dias 2, and Abdul-Haid Soubra 3, M.ASCE 1 LGCIE, INSA Lyon, Université de Lyon, Doaine scientifique

More information

Hybrid System Identification: An SDP Approach

Hybrid System Identification: An SDP Approach 49th IEEE Conference on Decision and Control Deceber 15-17, 2010 Hilton Atlanta Hotel, Atlanta, GA, USA Hybrid Syste Identification: An SDP Approach C Feng, C M Lagoa, N Ozay and M Sznaier Abstract The

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Good Learners for Evil Teachers

Good Learners for Evil Teachers Ofer Dekel Microsoft Research Microsoft Way Redond WA 985 USA Ohad Shair The Hebrew University Jerusale 994 Israel OFERD@MICROSOFT.COM OHADSH@CS.HUJI.AC.IL Abstract We consider a supervised achine learning

More information

Geometrical intuition behind the dual problem

Geometrical intuition behind the dual problem Based on: Geoetrical intuition behind the dual proble KP Bennett, EJ Bredensteiner, Duality and Geoetry in SVM Classifiers, Proceedings of the International Conference on Machine Learning, 2000 1 Geoetrical

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer. UIVRSITY OF TRTO DIPARTITO DI IGGRIA SCIZA DLL IFORAZIO 3823 Povo Trento (Italy) Via Soarive 4 http://www.disi.unitn.it O TH US OF SV FOR LCTROAGTIC SUBSURFAC SSIG A. Boni. Conci A. assa and S. Piffer

More information

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE Proceeding of the ASME 9 International Manufacturing Science and Engineering Conference MSEC9 October 4-7, 9, West Lafayette, Indiana, USA MSEC9-8466 MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL

More information

Research Article Robust ε-support Vector Regression

Research Article Robust ε-support Vector Regression Matheatical Probles in Engineering, Article ID 373571, 5 pages http://dx.doi.org/10.1155/2014/373571 Research Article Robust ε-support Vector Regression Yuan Lv and Zhong Gan School of Mechanical Engineering,

More information

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION A eshsize boosting algorith in kernel density estiation A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION C.C. Ishiekwene, S.M. Ogbonwan and J.E. Osewenkhae Departent of Matheatics, University

More information

Fast Structural Similarity Search of Noncoding RNAs Based on Matched Filtering of Stem Patterns

Fast Structural Similarity Search of Noncoding RNAs Based on Matched Filtering of Stem Patterns Fast Structural Siilarity Search of Noncoding RNs Based on Matched Filtering of Ste Patterns Byung-Jun Yoon Dept. of Electrical Engineering alifornia Institute of Technology Pasadena, 91125, S Eail: bjyoon@caltech.edu

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS Jochen Till, Sebastian Engell, Sebastian Panek, and Olaf Stursberg Process Control Lab (CT-AST), University of Dortund,

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

Low-complexity, Low-memory EMS algorithm for non-binary LDPC codes

Low-complexity, Low-memory EMS algorithm for non-binary LDPC codes Low-coplexity, Low-eory EMS algorith for non-binary LDPC codes Adrian Voicila,David Declercq, François Verdier ETIS ENSEA/CP/CNRS MR-85 954 Cergy-Pontoise, (France) Marc Fossorier Dept. Electrical Engineering

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

Domain-Adversarial Neural Networks

Domain-Adversarial Neural Networks Doain-Adversarial Neural Networks Hana Ajakan, Pascal Gerain 2, Hugo Larochelle 3, François Laviolette 2, Mario Marchand 2,2 Départeent d inforatique et de génie logiciel, Université Laval, Québec, Canada

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon Lecture 2: Enseble Methods Isabelle Guyon guyoni@inf.ethz.ch Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Data-Driven Imaging in Anisotropic Media

Data-Driven Imaging in Anisotropic Media 18 th World Conference on Non destructive Testing, 16- April 1, Durban, South Africa Data-Driven Iaging in Anisotropic Media Arno VOLKER 1 and Alan HUNTER 1 TNO Stieltjesweg 1, 6 AD, Delft, The Netherlands

More information

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation journal of coplexity 6, 459473 (2000) doi:0.006jco.2000.0544, available online at http:www.idealibrary.co on On the Counication Coplexity of Lipschitzian Optiization for the Coordinated Model of Coputation

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words) 1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Adapting the Pheromone Evaporation Rate in Dynamic Routing Problems

Adapting the Pheromone Evaporation Rate in Dynamic Routing Problems Adapting the Pheroone Evaporation Rate in Dynaic Routing Probles Michalis Mavrovouniotis and Shengxiang Yang School of Coputer Science and Inforatics, De Montfort University The Gateway, Leicester LE1

More information

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

REDUCTION OF FINITE ELEMENT MODELS BY PARAMETER IDENTIFICATION

REDUCTION OF FINITE ELEMENT MODELS BY PARAMETER IDENTIFICATION ISSN 139 14X INFORMATION TECHNOLOGY AND CONTROL, 008, Vol.37, No.3 REDUCTION OF FINITE ELEMENT MODELS BY PARAMETER IDENTIFICATION Riantas Barauskas, Vidantas Riavičius Departent of Syste Analysis, Kaunas

More information

Synthetic Generation of Local Minima and Saddle Points for Neural Networks

Synthetic Generation of Local Minima and Saddle Points for Neural Networks uigi Malagò 1 Diitri Marinelli 1 Abstract In this work-in-progress paper, we study the landscape of the epirical loss function of a feed-forward neural network, fro the perspective of the existence and

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

arxiv: v1 [stat.ot] 7 Jul 2010

arxiv: v1 [stat.ot] 7 Jul 2010 Hotelling s test for highly correlated data P. Bubeliny e-ail: bubeliny@karlin.ff.cuni.cz Charles University, Faculty of Matheatics and Physics, KPMS, Sokolovska 83, Prague, Czech Republic, 8675. arxiv:007.094v

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Data Dependent Convergence For Consensus Stochastic Optimization

Data Dependent Convergence For Consensus Stochastic Optimization Data Dependent Convergence For Consensus Stochastic Optiization Avleen S. Bijral, Anand D. Sarwate, Nathan Srebro Septeber 8, 08 arxiv:603.04379v ath.oc] Sep 06 Abstract We study a distributed consensus-based

More information

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification Eilie Morvant eilieorvant@lifuniv-rsfr okol Koço sokolkoco@lifuniv-rsfr Liva Ralaivola livaralaivola@lifuniv-rsfr Aix-Marseille

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007

Deflation of the I-O Series Some Technical Aspects. Giorgio Rampa University of Genoa April 2007 Deflation of the I-O Series 1959-2. Soe Technical Aspects Giorgio Rapa University of Genoa g.rapa@unige.it April 27 1. Introduction The nuber of sectors is 42 for the period 1965-2 and 38 for the initial

More information

Accuracy at the Top. Abstract

Accuracy at the Top. Abstract Accuracy at the Top Stephen Boyd Stanford University Packard 264 Stanford, CA 94305 boyd@stanford.edu Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 ohri@cis.nyu.edu Corinna

More information

arxiv: v1 [math.na] 10 Oct 2016

arxiv: v1 [math.na] 10 Oct 2016 GREEDY GAUSS-NEWTON ALGORITHM FOR FINDING SPARSE SOLUTIONS TO NONLINEAR UNDERDETERMINED SYSTEMS OF EQUATIONS MÅRTEN GULLIKSSON AND ANNA OLEYNIK arxiv:6.395v [ath.na] Oct 26 Abstract. We consider the proble

More information

Introduction to Discrete Optimization

Introduction to Discrete Optimization Prof. Friedrich Eisenbrand Martin Nieeier Due Date: March 9 9 Discussions: March 9 Introduction to Discrete Optiization Spring 9 s Exercise Consider a school district with I neighborhoods J schools and

More information