Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin

Size: px
Start display at page:

Download "Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin"

Transcription

1 Coordinate Descent Method for Large-scale L2-loss Linear SVM Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin Abstract Linear support vector machines (SVM) are useful for classifying largescale sparse data. Problems with sparse features are common in applications such as document classification and natural language processing. In this paper, we propose a novel coordinate descent algorithm for training linear SVM with the L2-loss function. At each step, the proposed method minimizes a one-variable sub-problem while fixing other variables. The sub-problem is solved by Newton steps with the line search technique. The procedure globally converges at the linear rate. Experiments show that our method is more efficient and stable than state of the art methods such as Pegasos and TRON. 1 Introduction Support vector machines (SVM) (Boser et al., 1992) are a popular data classification tool. Given a set of instance-label pairs (x j, y j ), j = 1,..., l, x j R n, y j { 1, +1}, SVM solves the following unconstrained optimization problem: min f(w) = 1 w 2 wt w + C l ξ(w; x j, y j ), (1) where ξ(w; x j, y j ) is a loss function, and C R is a penalty parameter. There are two common loss functions. L1-SVM uses the sum of losses and minimizes the following optimization problem: j=1 f(w) = 1 2 wt w + C l max(1 y j w T x j, 0), (2) j=1 while L2-SVM uses the sum of squared losses, and minimizes f(w) = 1 2 wt w + C l max(1 y j w T x j, 0) 2. (3) j=1 Department of Computer Science, National Taiwan University, Taipei 106, Taiwan. {b92084,b92085,cjlin}@csie.ntu.edu.tw 1

2 In some applications, we include a bias term b in SVM problems. For convenience, one may extend each instance with an additional dimension to eliminate this term: x T j [x T j, 1] w T [w T, b]. SVM usually maps training vectors into a high-dimensional (and possibly infinite dimensional) space, and solves the dual problem of (1) with a nonlinear kernel. In some applications, data appear in a rich dimensional feature space, so that with/without nonlinear mapping obtain similar performances. If data are not mapped, we can often train much larger data sets. We indicate such cases as linear SVM, which are often encountered in applications such as document classification. The objective function of L1-SVM (2) is nondifferentiable, so typical optimization methods cannot be directly applied. In contrast, L2-SVM (3) is a piecewise quadratic and strongly convex function, which is differentiable but not twice differentiable (Mangasarian, 2002). We focus on studying L2-SVM in this paper because of its differentiability. In recent years, several optimization methods are applied to solve linear SVM in large-scale scenarios. For example, Keerthi and DeCoste (2005); Mangasarian (2002) propose modified Newton methods to train L2-SVM. As (3) is not twice differentiable, to obtain the Newton direction, they use the generalized Hessian matrix (i.e., generalized second derivative). A trust region Newton method (TRON) (Lin et al., 2007) is proposed to solve logistic regression and L2-SVM. For large-scale L1-SVM, SVM perf (Joachims, 2006) uses a cutting plane technique to obtain the solution of (2). Smola et al. (2008) apply bundle methods, and views SVM perf as a special case. Zhang (2004) proposes a stochastic gradient method; Pegasos (Shalev-Shwartz et al., 2007) extends Zhang s work and develops an algorithm which alternates between stochastic gradient descent steps and projection steps. The performance is reported to be better than SVM perf. Another stochastic gradient implementation similar to Pegasos is Bottou (2007). All the above algorithms are iterative procedures, which update w at each iteration and generate a sequence {w k } k=0. To distinguish these approaches, we consider the two extremes of optimization methods mentioned in (Lin et al., 2007): Low cost per iteration; High cost per iteration; slow convergence. fast convergence. Among methods discussed above, Pegasos randomly subsamples a few instances at a time, so the cost per iteration is low, but the number of iterations is high. 2

3 In contrast, Newton methods such as TRON take significant efforts at each iteration, but converge at fast rates. In large-scale scenarios, usually an approximate solution of the optimization problem is enough to produce a good model. Thus, methods with low cost per iteration may be preferred as they can quickly generate a reasonable model. However, if one specifies an unsuitable stopping condition, such methods may fall into the situation of lengthy iterations. A recent overview on the tradeoff between learning accuracy and optimization cost is by Bottou and Bousquet (2008). Coordinate descent is a common unconstrained optimization technique, but its use for large linear SVM has not been exploited much 1. In this paper, we aim at applying it to L2-SVM. A coordinate descent method updates one component of w at a time by solving a one-variable sub-problem. It is competitive if one can exploit efficient ways to solve the sub-problem. For L2-SVM, the sub-problem is to minimize a single-variable piecewise quadratic function, which is differentiable but not twice differentiable. An earlier paper using coordinate descents for L2- SVM is (Zhang and Oles, 2001). The algorithm, called CMLS, applies a modified Newton method to approximately solve the one-variable sub-problem. Here, we propose another modified Newton method, which obtains an approximate solution by line searches. Two key properties differentiate our method and CMLS: 1. Our proposed method attempts to use the full Newton step if possible, while CMLS takes a more conservative step. Our setting thus leads to faster convergence. 2. CMLS maintains the strict decrease of the function value, but does not prove the convergence. We prove that our method globally converges to the unique minimum. We define an ɛ-accurate solution ŵ if f(ŵ) min f(w) + ɛ. w We prove that our process obtains an ɛ-accurate solution in O(log(1/ɛ)) iterations. From an optimization point of view, we have the convergence at a linear rate. Experiments show that our proposed method is more efficient and stable than existing algorithms such as CMLS, Pegasos, SVM perf and TRON. 1 For SVM with kernels, decomposition methods are popular, and they are related to coordinate descent methods. Since we focus on linear SVM in this paper, we do not discuss their relationships. 3

4 Algorithm 1 Coordinate descent algorithm for L2-SVM 1. Start with any initial w For k = 0, 1,... (outer iterations) (a) For i = 1, 2,..., n (inter iterations) i. Fix w1 k+1,..., w k+1 i 1, wk i+1,..., wn k and approximately solve the subproblem (4) to obtain w k+1 i. The organization of this paper is as follows. In Section 2, we describe and analyze our algorithm. Several implementation issues are discussed in Section 3. In Sections 4 and 5, we discuss and compare our approach with some existing optimization methods such as CMLS, TRON, SVM perf and Pegasos. Results show that the proposed method is efficient and stable. Finally, we conclude our work in Section 6. 2 Solving Linear SVM via Coordinate Descent In this section, we describe our coordinate descent method for solving L2-SVM given in (3). The algorithm starts from an initial point w 0, and produces a sequence {w k } k=0. At each iteration, wk+1 is constructed by sequentially updating each component of w k. This process generates vectors w k,i R n, i = 1,..., n, such that w k,1 = w k, w k,n+1 = w k+1, and w k,i = [w k+1 1,..., w k+1 i 1, wk i,..., w k n] T for i = 2,..., n. For updating w k,i to w k,i+1, we solve the following one-variable sub-problem: min f(w1 k+1,..., w k+1 i 1, wk i + z, w k z i+1,..., wn) k min f(w k,i + ze i ), z (4) where e i = [0,..., 0, 1, 0,..., 0] }{{} T. A description of the coordinate descent algorithm is in Algorithm 1. The function (4) can be rewritten i 1 as D i (z) = f(w k,i + ze i ) = 1 2 (wk,i + ze i ) T (w k,i + ze i ) + C j I(w k,i +ze i ) b 2 j(w k,i + ze i ), (5) 4

5 where b j (w) = 1 y j w T x j and I(w) = {j b j (w) > 0}. In any interval of z where the set I(w k,i +ze i ) does not change, D i (z) is quadratic. Therefore, D i (z), z R, is a piecewise quadratic function. As Newton method is suitable for quadratic optimization, here we apply it for the minimization. If D i (z) is twice differentiable, then the Newton direction would be The first derivative of D i (z) is: D i(z) = w k,i i + z 2C d = D i(0). (6) D (0) i j I(w k,i +ze i ) where x ji denotes the ith component of x j. y j x ji (b j (w k,i + ze i )), (7) Unfortunately, D i (z) is not twice differentiable as the last term of D i(z) is not differentiable at {z b j (w k,i + ze i ) = 0 for some j}. We follow (Mangasarian, 2002) to define the generalized second derivative: D i (z) = 1 + 2C = 1 + 2C j I(w k,i +ze i ) j I(w k,i +ze i ) y 2 j x 2 ji A simple Newton method to solve (4) begins with z 0 = 0 and iteratively updates z by the following way until D i(z) = 0: x 2 ji. z t+1 = z t D i(z t )/D i (z t ) for t = 0, 1,.... (9) Mangasarian (2002) proved that under an assumption, this procedure terminates in finite steps and solves (4). Coordinate descent methods are known to converge if at each inner iteration we uniquely attain the minimum of the sub-problem (Bertsekas, 1999, Proposition 2.7.1). Unfortunately, the assumption in (Mangasarian, 2002) may not hold in real cases, so taking the full Newton step (9) may not decrease the function D i (z). Furthermore, solving the sub-problem exactly is too expensive. An earlier approach of using coordinate descents for L2-SVM without exactly solving the sub-problem is in (Zhang and Oles, 2001). In their algorithm CMLS, the approximate solution is restricted within a region. By evaluating the upper 5 (8)

6 bound of generalized second-order derivatives in this region, one replaces the denominator of the Newton step (9) with that upper bound. This setting guarantees the decrease of D i (z). However, there are two problems. First, function decreasing does not imply that {w k } converges to the global optimizer. Secondly, the step size generated by evaluating the upper bound of generalized second derivatives may be too conservative. We describe details of CMLS in Section 4.3. While coordinate descent methods have been well studied in optimization, most convergence analyses assume that the one-variable sub-problem is exactly solved. We consider the result in (Grippo and Sciandrone, 1999), which establishes the convergence by requiring only the following sufficient decrease condition: D i (z) D i (0) σz 2, (10) where σ < 1/2 is a positive constant. We explain that in general the Newton direction (6) satisfies this condition. If D i (z) is quadratic around z = 0, then D i (z) D i (0) = D i(0)z D i (0)z 2. Using D i (0) > 1 in (8), z = D i(0)/d i (0) leads to D i(0) 2 D σ i(0) 2 2D i (0) D, i (0)2 so (10) holds. Now D i (z) is almost quadratic, so very probably the same situation occurs. If not, we can reduce the step size using the following theorem: Theorem 1 Given d as in (6). There is λ > 0 such that λd satisfies (10) for all 0 λ < λ. The proof is in Appendix A.1. Therefore, at each inner iteration of Algorithm 1, we take the Newton direction d as in (6), and then sequentially check λ = 1, β, β 2,..., where β (0, 1), until λd satisfies (10). Algorithm 2 lists the details of a line search procedure. In contrast to CMLS, which uses a modified Newton direction, we have explained that in general our procedure takes the full Newton step (i.e., λ = 1 satisfies (10)). Hence, we expect that our method converges more quickly. We discuss parameters in our algorithm. First, as λ = 1 is often successful, our algorithm is insensitive to β. We choose β as 0.5. Secondly, there is a parameter σ in (10). The smaller value of σ leads to a looser sufficient decrease condition, which reduces the time of line search, but increases the number of outer iterations. A common choice of σ is 0.01 in unconstrained optimization algorithms. 6

7 Algorithm 2 Solving the sub-problem using Newton direction with the line search. 1. Given w k,i. Set β = 1/2. 2. Calculate the Newton direction d = D i(0)/d i (0). 3. Compute λ = max{1, β, β 2,...} such that z = λd satisfies (10). Regarding the convergence of our algorithm, with the framework in (Grippo and Sciandrone, 1999), we have Theorem 2 If Algorithm 2 is used to approximately solve the sub-problem (4), then the sequence {w k } generated by Algorithm 1 converges to the global minimum of (3). The proof is in Appendix A.2. We further discuss the convergence rate, which helps to understand how fast our algorithm converges. We prove that there is µ (0, 1) such that f(w k+1 ) f(w ) µ(f(w k ) f(w )), k, (11) where w is the global minimum. Therefore, we have the following theorem: Theorem 3 Algorithm 1 linearly converges. The proof is in Appendix A.3. By (11), f(w k ) f(w ) µ k (f(w 0 ) f(w )), so our algorithm obtains an ɛ-accurate solution in O(log(1/ɛ)) iterations. Next, we investigate the computational complexity of Algorithm 1. The main cost comes from solving the sub-problem by Algorithm 2. In Step 2 of Algorithm 2, to evaluate D i(0) and D i (0), we need b j (w k,i ) for all j. Here we consider sparse data instances. Assume the total number of nonzero values in x 1,..., x l is #nz. Calculating b j (w), j = 1,..., l takes O(#nz) operations, which are large. However, one can use the following trick to save the time: b j (w + ze i ) = b j (w) zy j x ji, (12) If b j (w), j = 1,..., l are available, then obtaining b j (w+ze i ) involves only nonzero x ji s of the ith feature. Assume on average m = #nz n 7

8 is the number of nonzero values of the ith feature. Using (12), obtaining all b j (w+ ze i ) costs O(m). To have b j (w 0 ), we can start with w 0 = 0, so b j (w 0 ) = 1, j. With b j (w k,i ) available, the cost of evaluating D i(0) and D i (0) is O(m). Then in Step 3 of Algorithm 2, we need several line search steps using λ = 1, β, β 2,..., etc. For each λ, the main cost is on calculating D i (λd) D i (0) = 1 2 (wk,i i + λd) (wk,i i ) 2 ( + C (b j (w k,i + λde i )) 2 j I(w k,i +λde i ) j I(w k,i ) (b j (w k,i )) 2 ). (13) Note that from (12), if x ji = 0, b j (w k,i + λde i ) = b j (w k,i ). Hence, (13) involves no more than O(m) operations. In summary, Algorithm 2 costs O(m) for evaluating D i(0) and D i (0) + O(m) # of line search steps. From the explanation earlier and our experiments, in general the sufficient decrease condition holds when λ = 1. Then the cost of Algorithm 2 is about O(m). 3 Implementation Issues In this section, we discuss some techniques for a fast implementation of Algorithm 1. First, we aim at suitable data representations. Secondly, we derive a lower bound of λ in Theorem 1, so the evaluation of D i (λd) in the sufficient decrease condition (10) may be saved. Finally, we show that the order of sub-problems at each iteration can be any permutation of {1,..., n}. Experiments in Section 5 indicate that the performance of using a random permutation is superior to that of using the fixed order 1,..., n. 3.1 Data Representation For sparse data, we use a sparse matrix X to store the training instances so that x T 1 X =. x T l. 8

9 There are several ways to implement a sparse matrix. Two common ones are row format and column format (Duff and Meurant, 1989). For data classification, using column (row) format allows us to easily access any particular feature (instance). In our case, as we decompose the problem (3) into sub-problems over features, the column format is more suitable. 3.2 An Easy Check of the Sufficient Decrease Condition (10) In Section 2, we apply a Newton step to approximately solve the sub-problem, and Algorithm 1 converges if the sufficient decrease condition (10) holds. Each line search step takes O(m) computational time to evaluate D i (λd), where m is the average number of nonzero values per feature. In order to save time for calculating D i (λd), we derive a lower bound of λ in Theorem 1: Theorem 4 Assume d = D i(0)/d i (0). If λ D i (0) H i /2 + σ, (14) where H i = 1+2C l j=1 x2 ji, then z = λd satisfies the sufficient decrease condition (10). The proof is in Appendix A.4. H i is independent of w, so it can be precomputed before training. Furthermore, we already evaluate D i (0) in computing the Newton step, so it takes only constant time to check (14). In Step 3 of Algorithm 2, we sequentially use λ = 1, β, β 2,..., etc. Before calculating (10) using a smaller λ, we check if (14) holds. If it does, then there is no need to evaluate the new D i (λd). If λ = 1 already satisfies (14), the line search procedure is essentially waived. Thus the computational time is effectively reduced. 3.3 Random Permutation of Sub-problems In Section 2, we propose a coordinate descent algorithm which solves the onevariable sub-problems in the order of w 1,..., w n. One may use an arbitrary order of sub-problems. To prove the convergence, we require that each sub-problem is solved once at one outer iteration. At the kth iteration, we construct a random permutation π k of {1,..., n}, and sequentially minimize with respect to variables 9

10 w π(1), w π(2),..., w π(n). Similar to Algorithm 1, the algorithm generates a sequence {w k,i } such that w k,1 = w k, w k,n+1 = w k+1,1 and { wt k+1 w k,i t = w k t The update from w k,i to w k,i+1 is by w k,i+1 t = w k,i t if π 1 k (t) < i, if π 1 k (t) i. (15) + arg min f(w k,i + ze πk z (i)) if π 1 k (t) = i. Next we prove the convergence of the generalized algorithm: Theorem 5 If Algorithm 2 is used to approximately solve the sub-problem (4), then the sequence {w k,i } generated by Algorithm 1 with random permutations π k converges to the global minimum of (3). Theorem 6 Algorithm 1 with random permutations π k linearly converges. The proofs are in Appendices A.5 and A.6. Note that Theorems 2 and 3 are special cases of these theorems. Experiments in Section 5 show that a random permutation of sub-problems leads to a faster algorithm. If the number of features is very large, we may not need to go through all {w 1,..., w n } at each iteration. Instead, one can have a completely random setting by arbitrarily choosing a feature at a time. That is, from w k to w k+1 we only modify one component. A description is in Algorithm 3. The following theorem indicates the converge rate in expectation: Theorem 7 Let δ (0, 1). Algorithm 3 requires O(log( 1 )) iterations to obtain ɛδ an ɛ-accurate solution with confidence 1 δ. The proof is in Appendix A.7. 4 Related Methods In this section, we discuss three existing schemes for large-scale linear SVM. They will be compared in Section 5. The first one is Pegasos (Shalev-Shwartz et al., 2007), which is notable for its efficiency in training linear L1-SVM. The second one is a trust region Newton method (Lin et al., 2007). It is one of the fastest implementation for L2-SVM. The last one is CMLS, which is a coordinate descent method introduced in (Zhang and Oles, 2001). 10

11 Algorithm 3 Coordinate descent algorithm with randomly selecting one feature at a time 1. Start with any initial w For k = 0, 1,... (a) Randomly choose i k {1, 2,..., n}. and approximately solve the sub- (b) Fix w1, k..., wi k k 1, wi k k +1,..., wn k problem (4) to obtain w k+1 i k. 4.1 Pegasos for L1-SVM We briefly introduce the Pegasos algorithm in (Shalev-Shwartz et al., 2007). It is an efficient method for solving (2), the L1-SVM formulation. The algorithm has two parameters. One is the subsample size K, and the other is the penalty parameter C of SVM. It begins with an initial w 0 whose norm is at most Cl. At each iteration k, it randomly selects a set A k {x j, y j } j=1,...,l of size K as the subsamples of training instances and sets a learning rate η k = 1 k. (16) Then it updates w k with the following rule: ( ) Cl w k+1 = min 1, w k+ 1 w k+ 1 2, 2 w k+ 1 2 = w k η k k, k = w k Cl K j A + k (wk ) y j x j, A + k (w) = {j A k 1 y j w T x j > 0}, (17) where k is considered as a sub-gradient of the approximate objective function: f k (w) = 1 2 wt w + Cl K j A k max(1 y j w T x j, 0). w k+1/2 is a vector obtained by the stochastic gradient descent step, and w k+1 is the projection of w k+1/2 to the set {w w Cl}. Algorithm 4 lists the detail of Pegasos algorithm. In subsequent experiments, we set the subsample size K to one as suggested in (Shalev-Shwartz et al., 2007). 11

12 Algorithm 4 Pegasos algorithm for solving L1-SVM. 1. Given C and w 0 with w 0 Cl. 2. For k = 0, 1,... (a) Set learning rate η by (16). (b) Obtain w k+1 by (17). At each iteration, Pegasos chooses only one training instance for updating. Hence, the number of operations at each iteration is proportional to the average number of non-zero elements per training instance. This method is related to our Algorithm 3. They both have stochastic settings, but instead of using one instance per iteration, Algorithm 3 uses one feature per iteration. Theorem 7 indicates that Algorithm 3 requires O(log(1/ɛ)) iterations to obtain an ɛ-accurate solution, while Shalev-Shwartz et al. (2007) prove Pegasos requires a larger number O(1/ɛ). As Pegasos randomly selects a subset of training instances, it may be influenced by noisy data, and hence give unstable performances. We elaborate this point in Section 5.2. Regarding the stopping condition, as at each iteration Pegasos only takes one sample for updating w, neither function nor gradient information is available. This keeps Pegasos from designing a suitable stopping condition. Shalev-Shwartz et al. (2007) suggest to set a maximal number of iterations. However, deciding a suitable value may be difficult. We will discuss stopping conditions of Pegasos and other methods in Section Trust Region Newton Method (TRON) for L2-SVM The optimization procedure of TRON introduced in (Lin et al., 2007) has two layers of iterations. At each outer iteration k, TRON sets a size k of the trust region, and builds a quadratic model q k (s) = f(w k ) T s st 2 f(w k )s (18) as the approximation of the value f(w k + s) f(w k ), where f(w) is the objective function in (3) and 2 f(w) is the generalized Hessian (Mangasarian, 2002) of f(w). Then an inner conjugate gradient procedure approximately finds the 12

13 Algorithm 5 Trust region Newton method for L2-SVM. 1. Given w For k = 0, 1,... (a) Find an approximate solution s k of the trust region sub-problem (19). (b) Update w k and k according to (20). Newton direction by minimizing the following optimization problem: min q k (s) s subject to s k. TRON updates w k and k by the following rules: { w k+1 w k + s k if ρ k > η 0, = w k if ρ k η 0, [σ 1 min{ s k, k }, σ 2 k ] if ρ k η 1, k+1 [σ 1 k, σ 3 k ] if ρ k (η 1, η 2 ), [ k, σ 3 k ] if ρ k η 2, ρ k = f(wk + s k ) f(w k ), q k (s k ) (19) (20) where ρ k is the ratio of the actual reduction in the objective function to the approximation model q k (s). Users pre-specify parameters η 0 > 0, 1 > η 2 > η 1 > 0, and σ 3 > 1 > σ 2 > σ 1 > 0. We use η 0 =10 4, η 1 = 0.25, η 2 = 0.75, σ 1 =0.25, σ 2 = 0.5, σ 3 = 4, as suggested in (Lin et al., 2007). The procedure is listed in Algorithm 5. For the computational complexity, if we assume the total number of nonzero elements in training instances is #nz, then the main cost per TRON iteration is O(#nz) # conjugate gradient iterations. Compared to our approach or Pegasos, the cost per TRON iteration is high. It keeps TRON from quickly obtaining a usable model. However, when w gets close to the minimum, TRON takes the Newton step to achieve fast convergence. We give more observations in the experiment section. 13

14 Algorithm 6 CMLS algorithm for solving L2-SVM. 1. Given w 0 = 0, 0,i = For k = 0, 1,... (a) Set c k by (24). Let w k,1 = w k. (b) For i = 1, 2,..., n i. Evaluate z by (21). ii. w k,i+1 = w k,i + ze i. iii. Update by (22). (c) Let w k+1 = w k,n CMLS: A Coordinate Descent Method for L2-SVM In Sections 2 and 3, we introduced our coordinate descent method for solving L2-SVM. Here, we discuss the previous work (Zhang and Oles, 2001), which also applies the coordinate descent technique. Zhang and Oles refer to their method as CMLS. At each outer iteration k, it sequentially minimizes sub-problems (5) by updating one variable of (3) at a time. In solving the sub-problem, Zhang and Oles (2001) mention that using line searches may result in small step sizes. Hence, CMLS applies a technique similar to the trust region method. It sets a size k,i of trust region, evaluates the first derivative (7) of (5), and calculates the upper bound of the generalized second derivative subject to z k,i : U i (z) = 1 + β j (w) = Then we obtain the step z as: l β j (w k,i + ze i ), j=1 { 2C if y j w T x j 1 + k,i x ij, 0 otherwise. z = min(max( D i(z) U i (z), k,i ), k,i ). (21) The updating rule of is: k+1,i = 2 z + ɛ, (22) where ɛ is a small constant. 14

15 Problem l n # nonzeros a9a 32, ,592 astro-physic 62,369 99,757 4,834,550 real-sim 72,309 20,958 3,709,083 news20 19,996 1,355,191 9,097,916 yahoo-japan 176, ,026 23,506,415 rcv1 677,399 47,236 49,556,258 yahoo-korea 460,554 3,052, ,436,656 Table 1: Dataset statistics: l is the number of instances and n is the number of features. In order to speed up the process, Zhang and Oles (2001) smooth the objective function of sub-problems with a parameter c k [0, 1]: D i (z) = 1 2 (wk,i + ze i ) T (w k,i + ze i ) + C l b 2 j(w k,i + ze i ), (23) j=1 where b j (w) = { 1 y j w T x j if 1 y j w T x j > 0, c k (1 y j w T x j ) otherwise. Following the setting in (Zhang and Oles, 2001), we choose c k = max(0, 1 k/50), (24) and set the initial 0,i = 10, i. We find that the result is insensitive to these parameters. The detail of CMLS algorithm is listed in Algorithm 6. Zhang and Oles (2001) prove that if c k = 0, k, then the objective function of (3) is decreasing after each inner iteration. However, such a property may not imply that Algorithm 6 converges to the minimum. In addition, CMLS updates w by (21), which is more conservative than Newton steps. In Section 5, we show that CMLS takes more time and iterations than ours to obtain a solution. 5 Experiments and Analysis In this section, we compare our proposed coordinate descent algorithm with state of the art linear SVM solvers TRON, Pegasos, SVM perf and CMLS, in terms of stability and efficiency. We also discuss the stopping condition of these methods. 15

16 5.1 Data Sets Table 1 lists the number of instances (l), features (n), and non-zero elements (# nonzeros) of seven datasets. Except a9a, all others come from document classification. Past studies show that linear SVM performs as good as kernelized ones for such data. The detailed descriptions of astro-physic are mentioned in (Joachims, 2006), while others are in (Lin et al., 2007). To examine the testing accuracy, we use a stratified selection to split each set to 4/5 training and 1/5 testing. 5.2 Comparisons We compare the following six implementations. 1. CD: the coordinate descent method described in Section 2. We choose σ in (10) as CDPER: the method modified from CD by permuting sub-problems in each outer step. See the discussion in Section Pegasos: the primal estimated sub-gradient solver for L1-SVM (Shalev- Shwartz et al., 2007). See the discussion in Section 4.1. The source code is available online at 4. TRON: the trust region Newton method in (Lin et al., 2007). See the discussion in Section 4.2. We use the L2-loss linear SVM implementation in the software LIBLINEAR (version 1.21 with option -s 3; http: // 5. CMLS: a coordinate descent method for L2-SVM (Algorithm 3 of (Zhang and Oles, 2001)). It is discussed in Section SVM perf : version 2.1 based on version 3.0 of SVM struct. The source code is available at Notice that Pegasos and SVM perf solve L1-SVM, while others are for L2-SVM. We do not include the bias term in all the solvers, since Shalev-Shwartz et al. (2007); Joachims (2006) do not either. We choose the penalty parameter C = 1 16

17 for experiments 2. All the above algorithms are implemented in C++ with doubleprecision floating-point numbers. Using single precision (e.g., (Bottou, 2007)) may reduce the computational time in some situations, but this setting may cause numerical inaccuracy. We conduct experiments on an Intel 2.66GHz processor with 8GB of main memory under Linux. To see how quick these methods decrease the objective value, we check their CPU time of reducing the objective value to within 1% of the optimum, and present results in Table 2. We run TRON with the stopping condition f(w k ) 0.01, and Pegasos with 10 9 steps. to obtain the reference solutions for L1-SVM, and L2-SVM. Since objective values are stable under such strict stopping conditions, these solutions are seen to be very close to the optima. As the optimal function values of L1-SVM and L2-SVM may be quite different, we use the relative difference 1%, but not an absolute difference (e.g., the CPU time reaching 0.01 above the optimum). Overall, our proposed algorithms CDPER and CD perform well on all the datasets. For the large datasets (rcv1, yahoo-japan, yahoo-korea), CD is competitive with Pegasos, and significantly better than TRON, SVM perf and CMLS. With the permutation of sub-problems, CDPER is even better than CD. To show more detailed comparisons, Figure 1 presents time versus the relative difference of function value to the optimum: f(w k ) f(w ). f(w ) As a reference, we draw a horizontal dotted line to indicate the relative difference 1%. Consistent with the observation in Table 2, CDPER is more efficient and stable than others. The next comparison is to check the relationship between training time and testing accuracy. That is, we investigate which method achieves reasonable testing accuracy more quickly. By the reference solutions used above, we obtain the final testing accuracy for L1-SVM and L2-SVM. Then in Figure 2, we check the testing accuracy differences between the current model and the reference model along the training time. As an accurate solution of the SVM optimization problem does not imply the best testing accuracy, some implementations achieve higher accuracy before reaching the minimal function value. 2 The objective function of Pegasos and (2) are equivalent by setting λ = 1/(C number of instances), where λ is the penalty parameter used in (Shalev-Shwartz et al., 2007). The penalty parameter in SVM perf is equivalent to ours by setting C perf = 0.01Cl. 17

18 Dataset CDPER CD Pegasos TRON CMLS SVM perf a9a astro-physic real-sim news yahoo-japan rcv yahoo-korea Table 2: The training time for a solver to reduce the objective value to within 1% of the optimal value. Time is in seconds. The approach with the shortest running time is boldfaced. For implementations solving L2-SVM, as the gradient is available, we are interested in how fast these method decreases the norm of gradient. Figure 3 shows the result. Overall, CDPER converges faster in the beginning, while TRON is the best for final convergence. We do not include CD in Figures 1 and 2, because CDPER is better than it in almost all situations. Moreover, the result of SVM perf is omitted to avoid too many curves in the same figure. We also modified CD to randomly select only one feature at each step (Algorithm 3). The result is similar to CDPER. To fit subfigures in one page, we do not include the result of real-sim dataset. Below we give some observations of the experiments. The only difference between CD and CDPER is that CD solves sub-problems in order, but CDPER does not. Therefore, the cost per CD iteration is slightly lower than CDPER, but CDPER requires fewer iterations to achieve a similar accuracy value. Moreover, CDPER more quickly decreases the norm of gradients in Figure 3. This result seems to indicate that if the order of sub-problems is fixed, the update of variables becomes slower. However, as CD sequentially accesses features, it has better data locality in the computer memory hierarchy. An example is news20 in Table 2c. As the number of features is much larger than the number of instances, two adjacent sub-problems of CDPER may access two very far away features. Then the cost per CD iteration is only 1/5 of CDPER, so CD is better in this case. From the experimental results, CD converges faster than CMLS. Both are coordinate descent methods, and the cost per iteration is similar. However, CMLS suffers from lengthy iterations because its modified Newton method takes a conservative step size. In Figure 2d, the testing accuracy even does not reach a reasonable value after 30 seconds. Conversely, CD usually uses full Newton steps, 18

19 (a) a9a (b) astro-physic (c) news20 (d) rcv1 (e) yahoo-japan (f) yahoo-korea Figure 1: Time versus the relative difference of objective value to the minimum. The dotted line indicates the relative difference We show the training time for each solver to reach this ratio in Table 2. Time is in seconds. 19

20 (a) a9a (b) astro-physic (c) news20 (d) rcv1 (e) yahoo-japan (f) yahoo-korea Figure 2: The difference of testing accuracy between the current model and the reference model (obtained using strict stopping conditions) as the training time increases. Time is in seconds. 20

21 so it converges faster. For example, CD takes the full Newton step in % inner iterations for solving rcv1 (we check up to 5.96 seconds). We also try to permute sub-problems in CMLS as CDPER does. than CD and CDPER However, the result is worse Though Pegasos is efficient for several datasets, the testing accuracy is sometimes unstable (see Figures 2a and 2c). As Pegasos only subsamples a few training data to update w k, it is easily influenced by noisy data. This stability is also due to the slow final convergence of Pegasos. Though it quickly obtains a solution within 1% of the optimal function value, Figure 1 shows that it then slowly decreases the function value toward the end. This slow convergence makes the selection of stopping conditions (maximal number of iterations for Pegasos) more difficult. Finally, compared to TRON, CDPER is more efficient to yield good testing accuracy (See Table 2 and Figure 2); however, if we check the value f(w k ), Figure 3 shows that TRON converges faster in the end. This result is consistent with what we discussed in Section 1 on distinguishing various optimization methods. We indicated that a Newton method (where TRON is) has fast final convergence. Unfortunately, since the cost per TRON iteration is high, and the Newton direction is not effective in the beginning, TRON is less efficient in the early stage of the optimization procedure. 5.3 Stopping Conditions In this section, we discuss stopping conditions of our algorithm and other existing methods. In solving a strictly convex optimization problem, the norm of gradients is often considered in the stopping condition. The reason is that f(w) = 0 w is the global minimum. For example, TRON checks whether the norm of gradient is small enough for stopping. However, as our coordinate descent method updates one component of w at each inner iteration, we have only D i(0) = f(w k,i ) i, i = 1,..., n. A direct extension of Theorem 2 shows that w k,i w, so we have D i(0) = f(w k,i ) i 0, i. (25) Therefore, by storing D i (0), i, at the end of the kth iteration, one can check if n i=1 D i(0) 2 or max i D i(0) is small enough. For Pegasos, we mentioned in Section 4.1 that one may need a maximal number of iterations as the stopping condition due to the lack of function/gradient information. 21

22 (a) a9a (b) astro-physic (c) news20 (d) rcv1 (e) yahoo-japan (f) yahoo-korea Figure 3: The two-norm of gradient versus the training time. Time is in seconds. 22

23 6 Conclusions We propose and analyze a coordinate descent method for large-scale linear L2- SVM. The new method possesses sound optimization properties. Experiments indicate that our method is more stable and efficient than existing algorithms. We plan to extend our work to other challenging problems such as structured SVM. References Dimitri P. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA , second edition, Bernhard E. Boser, Isabelle Guyon, and Vladimir Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages ACM Press, Leon Bottou. Stochastic gradient descent examples, org/projects/sgd. Leon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, I. S. Duff and G. A. Meurant. The effect of ordering on preconditioned conjugate gradients. BIT, 29: , Luigi Grippo and Marco Sciandrone. Globally convergent block-coordinate techniques for unconstrained optimization. Optimization Methods and Software, 10: , Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, S. Sathiya Keerthi and Dennis DeCoste. A modified finite Newton method for fast solution of large scale linear SVMs. Journal of Machine Learning Research, 6: ,

24 Chih-Jen Lin, Ruby C. Weng, and S. Sathiya Keerthi. Trust region Newton method for large-scale logistic regression. In Proceedings of the 24th International Conference on Machine Learning (ICML), Software available at Olvi L. Mangasarian. A finite Newton method for classification. Optimization Methods and Software, 17(5): , S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: primal estimated subgradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning (ICML), Alex J. Smola, S V N Vishwanathan, and Quoc Le. Bundle methods for machine learning. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, Tong Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning (ICML), Tong Zhang and Frank J. Oles. Text categorization based on regularized linear classification methods. Information Retrieval, 4(1):5 31, A Appendix In this section, we prove theorems appeared in the paper. To begin, we define an upper bound of D i (z): H i = 1 + 2C l yj 2 x 2 ji = 1 + 2C j=1 l x 2 ji, (26) j=1 and derive several lemmas. The following Lemma 1 is related to the one-dimensional version of the Generalized Mean Value Theorem (Mangasarian, 2002, Proposition 3.2) for functions with Lipschitz continuous gradients. Lemma 1 For any i {1,..., n}, and any d R, D i(z)d H id 2 D i (z + d) D i (z) D i(z)d d2. (27) 24

25 Proof. For simplification, we define b j = b j (w k,i +ze i ), where w k,i is the current solution. Then we have D i (z + d) D i (z) = zd + d2 2 + C( j:b j +dy j x ji >0 = zd + d2 2 + C + C + C j:b j +dy j x ji 0,b j >0 j:b j +dy j x ji >0,b j 0 j:b j +dy j x ji >0,b j >0 b 2 j (b j + dy j x ji ) 2. (b j + dy j x ji ) 2 j:b j >0 2dy j x ji b j + d 2 x 2 ji Next, we prove the first inequality of (27). For the forth term of (28), we have For the fifth term, since b j 0 and b j + dy j x ji > 0, Combining (29) and (30), have b 2 j) (28) b 2 j 2dy j x ji b j + d 2 x 2 ji. (29) (28) zd + d2 2 + C + C (b j + dy j x ji ) 2 < d 2 x 2 ji. (30) j:b j >0 j:b j <0,b j +dy j x ji >0 zd + d2 2 + C j:b j >0 = D i(z)d H id 2. 2dy j x ji b j + d 2 x 2 ji d 2 x 2 ji 2dy j x ji b j + C l j=1 x 2 jid 2 Next, we prove the second inequality of (27). For the forth term of (28), we Therefore, b 2 j 2dy j x ji b j. (28) zd + d C = D i(z)d d2. j:b j >0 dy j x ji b j 25

26 Lemma 2 For any direction d, if then D i(z) d + H i /2 σ, (31) D i (z + d) D i (z) σd 2. Proof. From Lemma 1 and (31), D i (z + d) D i (z) D i(z)d H id 2 σd 2. Lemma 3 The sequence {w k,i } is in a bounded set. Proof. Since Algorithm 1 ensures the sufficient decrease condition (10), it suffices to prove that A = {w f(w) f(w 0 )} is bounded. If this property is wrong, there is a subsequence {w k } A such that w k, then f(w k ) 1 2 wk 2. This contradicts f(w k ) f(w 0 ), so the set is bounded. A.1 Proof of Theorem 1 Proof. By Lemma 1, we have D i (λd) D i (0) + σλ 2 d 2 D i(0)λd H iλ 2 d 2 + σλ 2 d 2 = λ D i(0) 2 D i (0) H iλ 2 D i(0) 2 D + i(0) 2 D i σλ2 (0)2 D i ( (0)2 = λ D i(0) 2 D i (0) λ( H ) i/2 + σ D i (0) ) 1. If we choose λ = D i (0) H i, then (10) is satisfied for all 0 λ < λ. /2+σ A.2 Proof of Theorem 2 Proof. By setting π k (i) = i, this theorem is a special case of Theorem 5. 26

27 A.3 Proof of Theorem 3 Proof. By setting π k (i) = i, this theorem is a special case of Theorem 6. A.4 Proof of Theorem 4 Proof. If λ D i (0) H i /2+σ, then D i (0) + H i λ 2 σ implies D i(0) λd + H i 2 σ. By Lemma 2, λd satisfies the sufficient decrease condition (10). A.5 Proof of Theorem 5 Proof. for Theorem 6. To prove this theorem, we need the following lemma, which is also used Lemma 4 The value λ selected by Algorithm 2 satisfies Proof. λ β λ, where λ = D i (0) H i /2 + σ. By Theorem 1, any λ [β λ, λ] satisfies the sufficient decrease condition (10). Since Algorithm 2 selects λ by trying {1, β, β 2,... }, the value selected will be at least β λ. We prove that the sequence {w k,i } globally converges to the unique minimum of (3). If this result is wrong, there is a subsequence whose points are bounded away from the optimum. Since {w k,i } is in a bounded set (by Lemma 3), from this subsequence, there is an index s and an infinite subset K such that lim k K wk,s = w and w is not optimal. (32) Since {w k } k K is an infinite sequence and the number of choices of π k, π k+1 is finite (only (n!) 2 ), there are permutations π and π +1 and an infinite subset ˆK K such that k ˆK, π k = π and π k+1 = π +1. For the convenience, we define w k,n+i = w k+1,i, π (n + i) = π +1 (i). for all i = 1,..., n. Since (3) is convex, (32) implies that f(w ) 0. Since {1,..., n} {π (s),..., π (n), π +1 (1) = π (n + 1),..., π +1 (n) = π (2n)}, 27

28 we can find p to be the first element such that f(w ) π (p) 0. We consider the update from w k,s to w k,s+1. Without loss of generality, we assume π (s) π (p), so f(w ) π (s) = 0. (33) In Section 2, we frequently used the notation D i (z). We omitted w k,i though D i (z) = f(w k,i + ze i ). Here we also need f(w + ze i ). To clearly distinguish them, we consider the following definitions: D i (z; w k,s ) = f(w k,s + ze i ) and D i (z; w ) = f(w + ze i ). From Algorithm 2 and π k = π, w k,s+1 i = { w k,s i w k,s i α k,s D i (0;wk,s ) D i (0;wk,s ) if i π (s), if i = π (s), (34) where α k,s is the line search factor λ. derivatives in (8), D i (0; w) 1 for all i, so we have From (33), Using (34), (35) and (36), By a similar derivation, We define α k,s D π (s) (0; wk,s ) By the definition of generalized second 1. (35) lim D k ˆK,k π (s)(0; w k,s ) = lim f(w k,s ) π (s) = 0. (36) k ˆK,k lim w k,s = lim w k,s+1 = w. k ˆK,k k ˆK,k lim k K,k wk,s = = lim k K,k wk,p = w. ( α β σβ 2 ) = min,, (37) σ + 1/2H π (p) σ + 1/2Hπ 2 (p) where β < 1 is the line search ratio used in Algorithm 2. By Lemma 2 and the assumption that f(w ) π (p) 0, we have D π (p)( α D π (p)(0; w ); w ) < D π (p)(0; w ). (38) 28

29 By Lemma 4, α k,p D π (0; (p) wk,p ) β, (39) H π (p)/2 + σ so, D π (p)( α D π (p)(0; w k,p ); w k,p ) D π (p)(0; w k,p ) α (D π (p)(0; w k,p )) (α ) 2 (D π (p)(0; w k,p )) 2 (by Lemma 1) D π (p)(0; w k,p ) α (D π (p)(0; w k,p )) 2 β 2 D π (p)(0; w k,p ) σ (σ + 1H 2 π (p)) 2 (D π (p)(0; w k,p )) 2 (from (37)) D π (p)(0; w k,p ) σ(α k,p ) 2 (D π (p) (0; wk,p )) 2 With (10), (D π (0;. (from (39)) (p) wk,p )) 2 D π (p)(0; w ) D π (p)(0; w k,p+1 ) ( = D π (p) α k,p D π (0; ) (p) wk,p ) D π (0; (p) wk,p ) ; wk,p Taking the limit of (40), D π (p)(0; w k,p ) σ(α k,p ) 2 D π (p) (0; wk,p ) 2 D π (p)( α D π (p)(0; w k,p ); w k,p ). D π (p) (0; wk,p ) 2 (40) which contradicts (38). A.6 Proof of Theorem 6 D π (p)( α D π (p)(0; w ); w ) D π (p)(0; w ), (41) Proof. To begin, we define 1-norm and 2-norm of a vector w R n : n w 1 = w i, w 2 = n wi 2. The following inequality is useful for proving this theorem: i=1 i=1 w 2 w 1 n w 2, w R n. Lemma 4 suggests that the step size λd in Algorithm 2 is no less than Assume H = max(h 1,..., H n ) and γ = β. We have σ+h/2 β H i /2+σ D π k (i) (0). w k,i+1 π k (i) wk,i π k (i) γ f(wk,i ) πk (i), (42) 29

30 where we use the fact D π k (i)(0) = f(w k,i ) πk (i). Taking the summation of (42) from i = 1 to n, we have w k+1 w k 1 γ n f(w k,i ) πk (i). (43) i=1 From (Mangasarian, 2002, Lemma 3.4), there is a positive constant K such that f(x) f(z) 2 K x z 2, x, z R n. Therefore, f(w k,i ) f(w k,1 ) 2 K w k,i w k,1 2. (44) By the definition of w k,i in (15), w k,i w k,1 2 = i 1 (w k,t+1 π k (t) wk,t π k (t) )2 w k+1 w k 2. (45) From (44) and (45), Hence, t=1 f(w k,i ) f(w k,1 ) 2 K w k+1 w k 2. f(w k,i ) πk (i) f(w k,1 ) πk (i) K w k+1 w k 2 K w k+1 w k 1. With (43), we have w k+1 w k 1 γ Therefore, n ( f(w k,1 ) πk (i) f(w k,i ) πk (i) f(w k,1 ) πk (i) ) i=1 γ( f(w k,1 ) 1 nk w k+1 w k 1 ). w k+1 w k 2 1 n w k+1 w k 1 γ n(1 + nγk) f(w k ) 1 γ n(1 + nγk) f(w k ) 2. (46) From (Mangasarian, 2002, Lemma 3.4), w k w 2 f(w k ) f(w ) 2 = f(w k ) 2. (47) 30

31 With (46), From (10), w k+1 w k 2 τ w k w 2, where τ = f(w k ) f(w k+1 ) = n i=1 n (f(w k,i ) f(w k,i+1 )) i=1 γ n(1 + nγk). σ(w k,i+1 π k (i) wk,i π k (i) )2 = σ w k+1 w k 2 2 στ w k w 2 2. By (Mangasarian, 2002, Lemma 3.3), there is κ > 0 such that Therefore, we have This is equivalent to Finally, we have f(w k ) f(w ) κ w k w 2 2. (48) f(w k ) f(w k+1 ) στ κ (f(wk ) f(w )). (f(w k ) f(w )) + (f(w ) f(w k+1 )) στ κ (f(wk ) f(w )). f(w k+1 ) f(w ) (1 στ κ )(f(wk ) f(w )). Notice that we can choose κ large enough to ensure στ κ converges. < 1, so Algorithm 1 linearly A.7 Proof of Theorem 7 Proof. variable y to be First, we denote the expectation value of a function g of a random E y (g(y)) = y P (y)g(y). Then for any vector x R n and a random variable I where P (I = i) = 1, i {1,..., n}, n 31

32 we have E I (x 2 I) = n i=1 x 2 i n = 1 n x 2. (49) At each iteration k (k = 0, 1,... ) of Algorithm 3, we randomly choose one index i k and update w k to w k+1. The expected function value after iteration k can be represented as E i0,...,i k 1,i k (f(w k+1 )). From (10), (42), (49), (47), and (48), we have This is equivalent to E i0,...,i k 1 E ik (f(w k+1 ) f(w k )) σe i0,...,i k 1 E ik ( w k+1 i k w k i k 2 ) σγe i0,...,i k 1 E ik ( f(w k ) ik 2 ) = σγ n E i 0,...,i k 1 ( f(w k ) 2 2) σγ n E i 0,...,i k 1 ( w k w 2 2) σγ κn E i 0,...,i k 1 (f(w k ) f(w )). E i0,...,i k (f(w k+1 )) f(w ) (1 σγ κn ) ( E i0,...,i k 1 (f(w k )) f(w ) ). From Markov inequality P ( Z a) E( Z )/a and the fact f(w k+1 ) f(w ), we have P ( f(w k ) f(w ) ɛ ) E ( f(w k ) f(w ) ) /ɛ. (50) To achieve an ɛ-accuracy solution with confidence 1 δ, we need the right-hand side of (50) to be less than δ. This indicates the iteration number k must satisfy E i0,...,i k 1 (f(w k )) f(w ) (f(w 0 ) f(w ))(1 σγ κn )k ɛδ. This is equivalent to k log(ɛδ) log(f(w0 ) f(w )) log(1 σγ κn ) = O(log( 1 ɛδ )). 32

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Journal of Machine Learning Research 9 (2008) 1369-1398 Submitted 1/08; Revised 4/08; Published 7/08 Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Kai-Wei Chang Cho-Jui

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 06, Taiwan S. Sathiya Keerthi

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,

More information

Subsampled Hessian Newton Methods for Supervised

Subsampled Hessian Newton Methods for Supervised 1 Subsampled Hessian Newton Methods for Supervised Learning Chien-Chih Wang, Chun-Heng Huang, Chih-Jen Lin Department of Computer Science, National Taiwan University, Taipei 10617, Taiwan Keywords: Subsampled

More information

Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification

Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification Proceedings of Machine Learning Research 1 15, 2018 ACML 2018 Preconditioned Conjugate Gradient Methods in Truncated Newton Frameworks for Large-scale Linear Classification Chih-Yang Hsia Department of

More information

An Improved GLMNET for L1-regularized Logistic Regression

An Improved GLMNET for L1-regularized Logistic Regression Journal of Machine Learning Research 13 (2012) 1999-2030 Submitted 5/11; Revised 1/12; Published 6/12 An Improved GLMNET for L1-regularized Logistic Regression Guo-Xun Yuan Chia-Hua Ho Chih-Jen Lin Department

More information

Trust Region Newton Method for Large-Scale Logistic Regression

Trust Region Newton Method for Large-Scale Logistic Regression Journal of Machine Learning Research 9 (2008) 627-650 Submitted 4/07; Revised 12/07; Published 4/08 Trust Region Newton Method for Large-Scale Logistic Regression Chih-Jen Lin Department of Computer Science

More information

The Common-directions Method for Regularized Empirical Risk Minimization

The Common-directions Method for Regularized Empirical Risk Minimization The Common-directions Method for Regularized Empirical Risk Minimization Po-Wei Wang Department of Computer Science National Taiwan University Taipei 106, Taiwan Ching-pei Lee Department of Computer Sciences

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Incremental and Decremental Training for Linear Classification

Incremental and Decremental Training for Linear Classification Incremental and Decremental Training for Linear Classification Authors: Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin Department of Computer Science National Taiwan University Presenter: Ching-Pei Lee

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

An Improved GLMNET for L1-regularized Logistic Regression

An Improved GLMNET for L1-regularized Logistic Regression An Improved GLMNET for L1-regularized Logistic Regression Guo-Xun Yuan Dept. of Computer Science National Taiwan University Taipei 106, Taiwan r96042@csie.ntu.edu.tw Chia-Hua Ho Dept. of Computer Science

More information

Change point method: an exact line search method for SVMs

Change point method: an exact line search method for SVMs Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification Journal of Machine Learning Research () 383-334 Submitted /9; Revised 7/; Published / A Comparison of Optimization Methods and Software for Large-scale L-regularized Linear Classification Guo-Xun Yuan

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

LIBLINEAR: A Library for Large Linear Classification

LIBLINEAR: A Library for Large Linear Classification Journal of Machine Learning Research 9 (2008) 1871-1874 Submitted 5/08; Published 8/08 LIBLINEAR: A Library for Large Linear Classification Rong-En Fan Kai-Wei Chang Cho-Jui Hsieh Xiang-Rui Wang Chih-Jen

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

Large-scale Linear RankSVM

Large-scale Linear RankSVM Large-scale Linear RankSVM Ching-Pei Lee Department of Computer Science National Taiwan University Joint work with Chih-Jen Lin Ching-Pei Lee (National Taiwan Univ.) 1 / 41 Outline 1 Introduction 2 Our

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Iteration Complexity of Feasible Descent Methods for Convex Optimization Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning,

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

A Revisit to Support Vector Data Description (SVDD)

A Revisit to Support Vector Data Description (SVDD) A Revisit to Support Vector Data Description (SVDD Wei-Cheng Chang b99902019@csie.ntu.edu.tw Ching-Pei Lee r00922098@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines Nonlinear Support Vector Machines through Iterative Majorization and I-Splines P.J.F. Groenen G. Nalbantov J.C. Bioch July 9, 26 Econometric Institute Report EI 26-25 Abstract To minimize the primal support

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Improvements to Platt s SMO Algorithm for SVM Classifier Design

Improvements to Platt s SMO Algorithm for SVM Classifier Design LETTER Communicated by John Platt Improvements to Platt s SMO Algorithm for SVM Classifier Design S. S. Keerthi Department of Mechanical and Production Engineering, National University of Singapore, Singapore-119260

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Chih-Jen Lin Department of Computer Science National Taiwan University Talk at International Workshop on Recent Trends in Learning, Computation, and Finance,

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Journal of Machine Learning Research 8 (27) 291-31 Submitted 1/6; Revised 7/6; Published 2/7 Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Gaëlle Loosli Stéphane Canu

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Support Vector Machine I

Support Vector Machine I Support Vector Machine I Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Numerical Optimization Techniques

Numerical Optimization Techniques Numerical Optimization Techniques Léon Bottou NEC Labs America COS 424 3/2/2010 Today s Agenda Goals Representation Capacity Control Operational Considerations Computational Considerations Classification,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

A simpler unified analysis of Budget Perceptrons

A simpler unified analysis of Budget Perceptrons Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Large-scale machine learning Stochastic gradient descent

Large-scale machine learning Stochastic gradient descent Stochastic gradient descent IRKM Lab April 22, 2010 Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming

Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

18.9 SUPPORT VECTOR MACHINES

18.9 SUPPORT VECTOR MACHINES 744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the

More information

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization

Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Ching-pei Lee Department of Computer Sciences University of Wisconsin-Madison Madison, WI 53706-1613, USA Kai-Wei

More information

Distributed Newton Methods for Deep Neural Networks

Distributed Newton Methods for Deep Neural Networks Distributed Newton Methods for Deep Neural Networks Chien-Chih Wang, 1 Kent Loong Tan, 1 Chun-Ting Chen, 1 Yu-Hsiang Lin, 2 S. Sathiya Keerthi, 3 Dhruv Mahaan, 4 S. Sundararaan, 3 Chih-Jen Lin 1 1 Department

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information