Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization

Size: px
Start display at page:

Download "Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization"

Transcription

1 Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Ching-pei Lee Department of Computer Sciences University of Wisconsin-Madison Madison, WI , USA Kai-Wei Chang Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA Editor: Michael Mahoney Abstract In recent years, designing distributed optimization algorithms for empirical risk minimization (ERM) has become an active research topic, mainly because of the practical need to deal with the huge volume of data. In this paper, we propose a general framework for training an ERM model by solving its dual problem in parallel over multiple machines. Viewed as special cases of our framework, several existing methods can be better understood. Our method provides a versatile approach for many large-scale machine learning problems, including linear binary/multiclass classification, regression, and structured prediction. We show that our method, compared with existing approaches, enjoys global linear convergence for a broader class of problems and achieves faster empirical performance. 1. Introduction With the rapid growth of data volume and model complexity, designing efficient learning algorithms has become increasingly important. Distributed optimization techniques, which decompose a large optimization problem into sub-problems and distribute the computational burden across multiple machines, have gained a great amount of interest. This type of approaches are particularly useful when the optimization problem involves massive computation and/or the data set is stored on multiple machines as it could not fit the capacity of a single node. However, the communication cost and the asynchronous nature of the process challenge the design of efficient optimization algorithms in the distributed environment. In this paper, we study distributed optimization algorithms for training machine learning models that can be represented under the regularized empirical risk minimization (ERM) framework. These models include binary/multi-class classification, regression, and structured prediction models, covering a variety of learning tasks. We specifically focus on linear models, which have been shown successful in dealing with large-scale data thank to their efficiency and interpretability. 1 Given a set of training data, {X i } l i=1, X i R n ci, c i N, where l, n > 0 are the number of instances and the dimension of the model respectively, regularized ERM models solve the following optimization problem: min f P (w) := g(w) + w R n 1. Linear models allow us to interpret the value of each feature from the model. l ( ξ i X T i w ). (1) i=1 1

2 In the literature, g and ξ i are usually called the regularization term and the loss term, respectively. We assume that f P is a proper, closed convex function. The definition in problem (1) is general and covers a variety of learning problems. To unify the definitions of different learning problems, we encoded the true labels (i.e., y i Y i ) in the loss term ξ i and the input data X i. For some learning problems, the space of X i is spanned by a set of variables whose size may vary for different i. Therefore, we represent X i as an n c i matrix. For example, in the part of speech tagging task, where each input sentence consists of several words, c i represents the number of words in the i-th sentence. We provide a detailed discussion of the loss terms for different learning problems in Section 6. Regarding the regularization term, common choices of g include the squared-l 2 norm, the l 1 norm, and the elastic net (Zou and Hastie, 2005) that combines both. In many applications, it is preferable to solve the Lagrange dual problem of problem (1), because the dual problem has better mathematical properties, making optimization easier. The Lagrange dual problem of (1) is where min α Ω f(α) := g (Xα) + X := [X 1,..., X l ], α := l ξi ( α i ), (2) α i R ci is the dual variable associated with X i, and for any function h( ), h ( ) is its convex conjugate: h (w) := z T w h(z), w. max z dom(h) The domain is defined to be Ω := l i=1 dom(ξ i ) R l i=1 ci. We use the following notations to simplify our description. i=1 α 1. α l, l l ξ(x T w) := ξ i (Xi T w), ξ ( α) := ξi ( α i ), i=1 i=1 G (α) := g (Xα). In the single-core case, optimization methods for the dual ERM (2) have been widely studied (see, for example, Yuan et al. (2012) for a review). However, most of the state-of-the-art single-core algorithms for the dual ERM are inherently sequential and hard to be parallelized. Moreover, in multi-machine environments, the cost of communication is relatively high; therefore, it is usually the bottleneck for a distributed optimizer. As a result, adapting these single-core methods to distributed environments for solving (2) is nontrivial. In particular, it is important to reduce the communication cost to make the optimization efficient. Consequently, algorithms with faster convergence rates are desirable for distributed optimization, because fewer iterations implies fewer communication rounds and thus lower communication cost. We observe that without careful consideration of this issue, existing distributed dual solvers do not achieve satisfactory training speed. Existing distributed algorithms like Chen et al. (2014); Zhuang et al. (2015); Lin et al. (2014); Zhang and Lin (2015); Lee et al. (2015b, 2017) often focus on using the gradient and/or the Hessian information to solve the primal form of ERM (1), whose computation either can be parallelized naturally or require only some modification in the implementation to adapt for distributed environments. However, in this paper, we show that with a careful design, dual approaches can be competitive with primal approaches and enjoy sound theoretical properties. Some methods focusing on theoretical communication efficiency reduce the number of communication rounds significantly but are impractical to use because they either require additional assumptions on data points distribution or have excessively high computational cost per iteration. 2

3 On the contrary, our focus in this work is designing a practical algorithm with strong empirical performance in terms of overall running time for practitioners. In particular, it is well-known that first-order methods can achieve better theoretical iteration complexity, but second-order methods like truncated Newton or (limited-memory) BFGS tend to need fewer number of iterations in practice, leading to relatively short running time. We, therefore, design a practically efficient approach that is similar to the second-order methods but without requiring lengthy rounds of communication for computing (approximate) Hessian at each iteration. Our distributed learning algorithm aims for solving (2). At each iteration, our framework minimizes a problem consisting of a second-order approximation of the non-separable g (Xα) term in (2) with the quadratic term being a positive semidefinite block-diagonal matrix, plus the original separable loss conjugate ξ ( α). We consider the setting that the training instances are distributed across K machines, where the instances on machine k are {X i } i Jk. In our setting, J k are disjoint index sets such that K J k = {1,..., l}, J i J k = φ, i k. (3) k=1 No further assumption on how those instances are distributed across machines is imposed. Therefore, the data distributions on different machines can be totally different. Without loss of generality, we assume that j 0 := 0, j K := l, J k := {j k 1 + 1,..., j k }, k = 1,..., K. For any vector v R l, v Jk denotes the sub-vector in R Jk that contains the coordinates of v indexed by J k. When not specified, always denotes the Euclidean norm. We discuss how to choose the approximation to make our method efficient in terms of empirical convergence and communication. After solving the approximated problem, we conduct a line search that only requires negligible computational cost and O(1) communication to ensure sufficient decrease of the function value. With the line search procedure, our algorithm achieves faster empirical performance. By utilizing a relaxed condition, our method is able to achieve global linear convergence for many problems whose dual objective is non-strongly convex, including the popular support vectors machines (SVM) model proposed by Boser et al. (1992); Vapnik (1995), and the structured SVM (SSVM) problem (Tsochantaridis et al., 2005; Taskar et al., 2004). In other words, our algorithm only requires O(log(1/ɛ)) iterations, or equivalently, rounds of communication, to obtain an ɛ-accurate solution for (2). The theoretical analysis then shows this result is equivalent to that obtaining an ɛ-accurate solution for the original ERM problem (1) also takes only O(log(1/ɛ) iterations. We also discuss the main differences between our approach and existing distributed algorithms for (2). Experiments demonstrate that our algorithm is significantly faster than existing distributed solvers for (2). We note that Zheng et al. (2017) recently proposed an accelerated method for solving the dual ERM problem in a distributed setting. Their acceleration technique from Shalev-Shwartz and Zhang (2016) is similar to the catalyst framework for convex optimization (Lin et al., 2015). In essence, at every iteration, they add a term κ/2 w z 2 2 to (1) and approximately solve the dual problem of the modified primal problem by existing distributed methods for (2). The solution is then used to generate z for the next iteration. Like the catalyst framework that can be combined with any convex optimization algorithm, this acceleration technique in Zheng et al. (2017) can also be applied on top of our framework easily by letting the solver of the modified dual problem be our algorithm. Therefore, we focus our comparison on methods that directly solve the original optimization problem (2), since methods that are faster for the original problem are expected to perform better as well when combined with the acceleration technique. Special cases of our framework were published earlier as conference and workshop papers (Lee and Roth, 2015; Lee et al., 2015a). In this work we unify the results and extend the previous work to a general setting. We also provide more thorough theoretical and empirical analysis. The remaining of the paper is organized as follows. We describe the algorithm overview in Section 2. 3

4 The implementation details for distributed environments are shown in Section 3. In Section 4, we analyze the convergence property of our algorithm. Section 5 discusses related works for distributed ERM optimization. Application of our algorithm on different ERM problems is described in Section 6. Numerical experiments are provided in Section 7. Some possible extensions and limitations are discussed in Section 8. We then make some concluding remarks in Section 9. The program used in this work is available at 2. Algorithm Throughout this work, we make the following assumption. Assumption 2.1 There exists σ > 0 such that the regularizer g in the primal problem (1) is σ- strongly convex. Namely, g(αw 1 +(1 α)w 2 ) αg(w 1 )+(1 α)g(w 2 ) σα(1 α) w 1 w 2 2, w 1, w 2 R n, α [0, 1]. 2 (4) Note that the goal of solving the dual problem is to obtain a solution to the original primal problem (1). It can easily be shown that strong duality between (1) and (2) holds, which means that any pair of primal and dual optimal solutions (w, α ) satisfies the following relation. f P (w ) = f (α ). By (Hiriart-Urruty and Lemaréchal, 2001, Part E, Theorem 4.2.2), since g is σ-strongly convex, g is differentiable and has (1/σ)-Lipschitz continuous gradient everywhere. This also indicates that even if g is extended-valued, g is still finite everywhere, hence the only constraint of the feasible region is α Ω. From the KKT optimality conditions, we further have w = g (Xα ). (5) Although (5) only holds at the optimum, we can still define w(α) as the primal solution associated with α by the same relation for any α feasible for (2): w(α) := g (Xα). (6) Our algorithm is an iterative descent method for solving (2). Starting with any arbitrary feasible α 0, it generates a sequence of iterates {α 1, α 2,... } with the property that f(α i ) f(α j ) if i > j. The iterates are updated by a direction α t and a step size η t 0. α t+1 = α t + η t α t, t 0. (7) In the objective function of (2), the elements in ξ usually can be computed separately. Therefore, it can be optimized directly in a coordinate-wise manner. However, g tends to be complex and hard to optimize. Thus, we use a second-order approximation that results in an easier to solve problem as a surrogate. As we mentioned above, g and thus G are differentiable and the gradients are Lipschitz continuous. Given the current iterate α t, we solve α t := arg min α Q αt B t ( α) := G (α t ) T α ( α)t B t α + ξ ( α t α) (8) for some symmetric B t, and then take α t as the update direction. The matrix B t can be varied over iterations and has a wide range of choices, depending on the goal. One usually wants to pick B t such that (8) is easy to solve to reduce the cost per iteration, or that Q αt B t ( α t ) is a tighter 4

5 approximation of f(α t + α t ) f(α t ) so that it takes fewer iterations to obtain a good solution. The extreme case of the former is to use a diagonal B t while the other extreme end for the latter is using B t as the Hessian. We will show in Section 4 that as long as the whole objective of (8) is strongly convex and that B t is positive semidefinite, the obtained α t will be a descent direction. We then consider two different line search possibilities to obtain the step size η t. The first is the exact line search. η t = arg min η R f(αt + η α t ). However, this approach can be conducted only when the objective function has some specific structure to allow so. In the general case, we consider a backtracking line search with the Armijo rule. Given β (0, 1), τ (0, 1), this procedure finds the smallest i 0 such that η = β i satisfies f(α t + η α t ) f(α t ) + ητ t, (9) where and take η t = η. t := G (α t ) T α t + ξ ( α t α t ) ξ ( α t ), (10) 3. Distributed Implementation for Dual ERM We now discuss how to adapt the algorithm framework in Section 2 to make distributed optimization for the dual ERM efficient. In particular, we will discuss the choice of B t in (8), the trick of making line search efficient, and how to solve (8) with the minimum amount of communication. For the ease of algorithm description, we will use the following representation of X and α. where X = [x 1,..., x N ], α = [α 1,..., α N ] T, (11) N := is the number of columns of X as well as the dimension of α. The index sets for the representation (11) corresponding to J k is denoted by J k {1,..., N}, k = 1,..., K. We define π(i) = k if i J k. 3.1 Update Direction In this section, we discuss how to select B t such that the objective of (8) is strongly convex, the subproblem can be solved efficiently without communication, and that the resulting problem is a good approximation of (2). Notice that we distribute the data to multiple machines, and the k-th machine only stores and handles X i and the corresponding α i for i J k. Therefore, in order to reduce the communication cost, we need to pick B t in a way that (8) can be decomposed into independent sub-problems, and each sub-problem only involves a subset of the data. In such way, each subproblem can be solved locally on one machine without communicating with others. According to the partition (3), we should consider a block-diagonal B t such that l i=1 c i (B t ) i,j = 0, if π(i) π(j). (12) Any B t satisfying (12) falls in the block-diagonal approximation framework of our method. In most cases, we would like to incorporate higher-order information as much as possible to obtain fast convergence, which will then reduce the rounds of expensive communication. A natural choice would then be the Hessian matrix. H α t := 2 G (α t ) = X T 2 g ( Xα t) X. (13) 5

6 However, the Hessian matrix is usually dense and does not satisfy (12). Therefore, we consider its block-diagonal approximation H α t. { ( (H Hα t) α t) = i,j if π(i) = π(j), (14) i,j 0 otherwise. When the function g is complicated such that 2 g cannot be calculated easily, one can use the identity matrix, or some diagonal approximation, to substitute it in (13). Note that since ( 2 G (Xα t )) i,j = x T i 2 g (Xα t )x j, if each machine maintains the information of Xα t, entries of (14) can be decomposed into parts that each only requires information from the data enumerated in one index set J k. Thus the sub-problems can indeed be solved independently on corresponding machines without communication. Another concern of using the Hessian matrix is that it might be only positive semidefinite, so that the requirement that the objective of (8) needs to be strongly convex is not satisfied when ξ ( α) is non-strongly convex. In this case, we can add a damping term to B t to ensure the strong convexity. Therefore, our choice for B t can be generalized to the following formulation. B t = a t 1 H αt + a t 2I, for some a t 1, a t 2 0. (15) The choices of a t 1 and a t 2 might depend on the problem structures and the applications. In most cases, a t 2 = 0 should be considered, especially when it is known that either ξ ( α) is already strongly convex, or H αt is positive definite. For a t 1, practical results (Pechyony et al., 2011; Yang, 2013) suggest that a t 1 [1, K] leads to fast convergence, while we prefer a t 1 1 as it is a closer approximation of the Hessian matrix. One worth noticing special case is when a t 1 = 0, our framework reduces to proximal gradient. In solving (8) with our choice (15) and a t 1 0, each machine needs the information of Xα t to calculate both (B t ) Jk,J k and ( G ( α t)) = X T J k J k g ( Xα t). (16) Therefore, after each iteration of updating α t, we need to synchronize the information v t := Xα t = K k=1 j J k x j α t j through inter-machine communication. This communication of an O(n) vector is more suitable than communicating either the Hessian or the whole X together with α t. However, because we need to use the update direction to conduct line search, the better approach is to synchronize the following vector instead. K v t := X α t = x j αj, t (17) k=1 j J k and then update v t+1 = v t + η t v t locally after the step size η t is determined. More explanations of communicating this vector will be provided in the next section when we discuss the details of conducting line search. 6

7 3.2 How to Conduct Line Search Efficiently After the update direction α is decided from solving (8), we need to conduct line search to ensure sufficient function value decrease. For the backtracking variant, On the right-hand side of (9), the first term is already available from the previous iteration, and we thus only need to calculate (10). From (16), this value can be calculated by t = g ( v t) T v t + ( ξ ( α t α t) ξ ( α t)). (18) To obtain the terms related to ξ, we need an O(1) communication, and no additional computation is needed because the information of ξ ( α t α t ) can be maintained in the process of solving (8) easily. For the first term in (18), because each machine has full information of v t, v t and hence g (v t ), the calculation can be conducted distributedly. Thus we can combine the local partial sums of all terms in (18) in one scalar to communicate. One can also see from this calculation that synchronizing v t is inevitable to compute the required value efficiently. For the left-hand side of (9), we use the form (2) to illustrate. The ξi term is calculated distributedly in nature, where normally each term only costs O(1). For the g (v) term, if it is separable, then the computation is also parallelizable. Further, we may even have a closed form solution to calculate it in O(1) time for different η if some values are precomputed in some special cases. For example, when g (v) = 1 2 v 2, we have that g (v + η v) = 1 2 ( ) v 2 + η 2 v 2 + 2ηv T v. (19) In this case we can compute v 2 and v T v distributedly in advance, then the calculation of (19) with different η requires only O(1) computation and does not need any communication. For the general case, though the computation might not be this low, by maintaining both v and v, the calculation of g(v + η v) requires no more communication but only at most O(n) computation locally, and this cost is negligible as other parts of the algorithm already involves more expensive computations. The line search procedure is summarized in Algorithm 1. For the exact line search variant, it is possible only when f(α + η α) η has an analytic solution, which requires both g and ξ to be differentiable at least in some open set. For example, when the objective of f is quadratic, the optimal step size can be obtained by f(α + η α) η = 0 = 0 η = f(α)t α α T 2 f(α) α, (20) and then project back to the largest η [0, 1] such that α + η α Ω. Another way to approximate the exact line search variant is to consider a bisection method if the above assumption for analytic solution does not hold. In this case, we can utilize the trick of maintaining both v and v mentioned above to reduce the communication cost of re-evaluating the objective value. 3.3 Local Solver at Each Machine When we pick B t satisfying (12), the problem (8) can be decomposed into K independent subproblems of the following form. min G (α t ) T α Jk R J k J k α Jk αt J k (B t ) Jk,J k α Jk + ξ ( α t i α i ). (21) i J k 7

8 Algorithm 1: Distributed backtracking line search Input: α, α R l, β (0, 1), τ (0, 1), f(α) R, v = Xα, v = X α Form a partition {Ĵk} K k=1 of {1,..., n} Distributedly calculate t : O(1) communication K t = g ( v t) ( Ĵ v t ) + ξ k Ĵ k j ( α j + α j ) ξj ( α j ). j J k k=1 Let η 1 Calculate f(α + η α) using v and η v while f(α + η α) > f(α) + ητ t do η ηβ Calculate f(α + η α) using v and η v end Output: η, f(α + η α) O(1) communication O(1) communication Note that since all the information needed in (21) is available on machine k, the sub-problems can be solved without any inter-machine communication. Our framework does not limit the solver for (8). Instead, one can take an arbitrary local solver that is suitable for the specific problem. For example, we can consider (block) coordinate descent, (accelerated) proximal methods, and so on. Because each sub-problem is very close to the original dual problem of regularized ERM shown in (2), the cost is also similar, and one can consider methods that are efficient in the single-machine setting. In our experiment, we will use random-permuted cyclic coordinate descent for the dual ERM (Hsieh et al., 2008; Yu et al., 2011; Chang and Yih, 2013) as our local solver, since this approach has been proven empirically to be efficient in the single-core setting. Note that although the theoretical convergence rate of cyclic coordinate descent can be as worse as O(l 2 ) times slower than its randomized, non-cyclic counter part (Sun and Ye, 2016), its random-permuted version is known to behave similarly to the randomized and non-cyclic version, while preserving the deterministic convergence guarantee of cyclic coordinate descent. Other options can be adopted in discretion for specific problems or data sets, but such discussion is beyond the scope of this work. 3.4 Output the Best Primal Solution The proposed algorithm is a descent method for the dual problem (2). In other words, it is guaranteed that f(α t1 ) < f(α t2 ) provided t 1 > t 2. However, when we obtain the primal iterates w t from (6), there is no guarantee that the corresponding primal objective is strictly decreasing. Although we are able to prove that the primal objective converges R-linearly in Section 4, we still cannot guarantee that the latest primal iterate is the one with the lowest primal objective. This situation happens for all methods that solve the dual problem. To deal with this problem, we keep track of the primal objective of each iterate, and when the algorithm is terminated, we take the iterate with the lowest primal objective as the final model. This is known as the pocket approach in the literature of the perceptron algorithm (Gallant, 1990). 3.5 Stopping criterion If we consider only the sub-optimality of f(α t ), then we can use some simple indicators of suboptimality such as the norm of the update direction, or the size of t, whose values are zero at 8

9 Algorithm 2: Distributed block-diagonal approximation method for the dual ERM problem (2). Input: a 1, a 2 0 but not both 0, a feasible α 0 for (2), ɛ 0 f, w 0 Compute v 0 = K k=1 j J k x j α 0 j and ξ ( α 0 ) O(n) communication Compute f(α 0 ) by v 0 and ξ ( α 0 ) for t = 0, 1, 2,... do Compute f P (w(α t )) by (6) O(1) communication if f P (w(α t ) < f then f f P (w(α t ), w w(α t ) end if f(α t ) + f P (w(α t )) ɛ(f(α 0 ) + f P (w(α 0 ))) then Output w and terminate end Each machine obtains α t J k by solving the corresponding block of (8) independently and in parallel using the local data, with B decided by (15) Communicate v t = K k=1 ( j J k x j α t j Variant I: Conduct line search through Algorithm 1 to obtain η t Variant II: Solve for η t arg min η f(α t + η α t ) Each machine conducts in parallel: α t+1 J k end ) O(n) communication α t J k + η t α t J k, v t+1 v t + η t v t the optima, as a practical stopping condition. However, since we have already computed the primal objective at each iteration as discussed above, we can directly use this as the stopping criterion with no additional cost, and this criterion should be a more direct indicator of the model quality than others. Therefore, we consider the following stopping criterion: f(α t ) + f P (w(α t )) ɛ ( f ( α 0) + f P ( w ( α 0))), with some user-specified ɛ 0. The overall procedure for distributedly optimizing (2) discussed in this section is described in Algorithm Cost per Iteration Here we summarize the cost per iteration of our algorithm. We will individually discuss the time spent on forming the quadratic problem using (15), solving (8), conducting line search, and maintaining the best primal solution, then summarize them together. For the ease of description and analysis, we assume that the data entries are distributedly evenly across machines. Namely, we assume that the number of columns of X on each machine is O(N/K), and each machine has O(#nnz/K) nonzero entries, where #nnz is the number of non-zero elements in X. We will separate the costs for computation and for communication in our discussion. In our cost analysis below, we do not make those assumptions of low computational cost for f and g result from the special cases such as g(w) = w 2 /2 we discussed above. Instead, we only assume that the computation of g (v) and g (v) costs O(n), as further acceleration is a case-by-case possibility depending on the specific function structure. Therefore, as we do not assume any further structure of the problem, we only discuss the backtracking line search variant. The part of 2 g (v) is assumed to be negligible, for otherwise we will simply replace it with a multiple of the identity matrix as discussed in Section

10 We also assume that the cost of evaluating one ξi is proportional to the dimension of the domain, namely O(c i ), so the evaluation of ξ costs O(N/K) on each machine. We first check the cost of forming the problem (8). Note that we do not explicitly compute the values of B t and G (α t ). Instead, we only compute G(α t ) T α t through g (v t ) T v t, and the part ( α) T B t α under the choice (15) is obtained through α 2 and ( v t ) T 2 g (v t ) v t. Therefore, we only need to compute g (v t ) in this step, which costs O(n) under our assumption given that v t is already available on all machines. From v t to calculate g (v t ) costs O(n), and from our assumption it takes the same effort to get 2 g (v t ). Thus forming the problem (8) costs O(n) in computation and no communication is involved. Next, the cost of solving (8) is O(#nnz/K), as noted in most state-of-the-art single-core optimization methods for the dual ERM,for example, Hsieh et al. (2008); Yu et al. (2011), which is the cost of passing through the local data for a constant times. This part also involves no communication. For the line search part, as discussed in Section 3.2, we first need to make v t available on all machines. The calculation of v t through (17) costs O(#nnz/K) computation time, and since the vector is of size n, it also takes O(n) communication cost to gather information from all machines. Note that usually the value v t is also maintained throughout the process of solving the sub-problem (8) so it might not need additional costs to recompute it. Since here the asymptotic complexity is the same as solving the sub-problem, whether counting it or not does not affect the result. Now since v t is available on all machines, and we have g (v t ) from the part of forming the sub-problem, we can calculate the first term of (18) in O(n/K) computational cost and O(1) communication cost. The term related to ξ is a sum over l individual functions and therefore costs O(l/K), but it usually requires no more computational cost as it can be easily maintained in the process of solving the sub-problems. Then summing them up requires an O(1) communication that can be combined with the communication for obtaining g (v t ) T v t. Given v t and v t, for each evaluation of f under different η, it takes only O(n) to compute v t + η v t and to evaluate the corresponding g, while the part of ξ costs only O(N/K) in computation as it is a sum over l individual functions, and the required communication is O(1) to sum the functions up. Therefore, each backtracking line search iteration costs O(n + N/K) computation and O(1) communication. Finally, from (6), the vector w(α t ) is the same as the gradient vector we need in (8), so there is no additional cost to obtain the primal iterate, and evaluating the primal objective costs O(n) for g(w(α t )) and O(#nnz/K) for X T w(α t ) in computation. Thus the cost of the primal objective computation is O(#nnz/K + n). It also takes O(1) communication to gather the summation over ξ i. By assuming that each row and each column of X has at least one non-zero entry (for otherwise we can simply remove that row or column), we have n + N = O(#nnz). Thus in summary, each iteration of Algorithm 2 costs ( ( ) ) #nnz N O K + n + K + n #(line search) in computation and O (n + #(line search)) in communication. We will show in the next section that the number of line search is upper-bounded by a constant, so the overall cost per iteration is O(#nnz/K + n) computationally and O(n) for communication. 4. Analysis In this section, we provide theoretical worst case guarantees for our algorithm. Note that these worst cases barely happen empirically, and the analysis only serve as a certificate that even in these cases, our algorithm converges with a certain rate, but not to suggest that our algorithm always 10

11 converges as slowly. The focus of this work is not to tune the choice of B t that results in the best theoretical results, which will reduce the algorithm to a simple proximal gradient method, but to propose a practical algorithm that works well empirically and has a worst case guarantee. Therefore, the theoretical results may seem no better than the proximal gradient method, but we will show in Section 7 that the practical behavior of our algorithm is better than state of the art. The reason is that existing iteration complexity analyses are all determined by one iteration that has the worst possible progress in theory in the whole optimization procedure, but not depicting the overall behavior of an optimization method that barely encounters the worst case in reality. We start from showing that the update direction is indeed a descent one, and the backtracking line search procedure terminates within a bounded number of steps. Lemma 4.1 Given the current iterate α t, if B t is chosen so that Q αt B t is c 2 -strongly convex for some c 2 > 0 and c 1 + c 2 > 0, where c 1 is the smallest eigenvalue of B t, then the solution α t obtained by solving (8) is a descent direction for f at α t, namely t < 0. In particular, t c 1 + c 2 α t 2 0. (22) 2 Moreover, the backtracking line search procedure in Algorithm 1 generates a step size satisfying (9) within max(0, log β (1 τ) σ(c 1 + c 2 )/ X T X ) iterations, and the step size is lower bounded by ( η t min 1, β(1 τ)σ(c ) 1 + c 2 ) X T. (23) X Proof For any λ [0, 1], we have ( Q αt B t α t ) ( Q αt B t λ α t ) ( λq αt B t α t ) + (1 λ) Q αt B t (0) c 2λ (1 λ) α t 2 ( 2 = λ G ( α t) T α t + 1 ( α t ) T Bt α t + ξ ( α t α t)) + (1 λ) ξ ( α t) 2 c 2λ (1 λ) 2 α t 2, where in the first inequality we used that α t is the minimizer of (8), and in the second one we used the strong convexity of Q αt B t (see (4)). Rearranging this inequality gives us (1 λ) ( G ( α t) T α t + ξ ( α t α t) ξ ( α t)) (λ 1) ( α t ) T Bt α t c 2λ (1 λ) α t Divide both sides by (1 λ), we get G ( α t) T α t + ξ ( α t α t) ξ ( α t) 1 ( α t ) T Bt α t c 2λ α t 2 c 1 λc 2 α t Let λ 1, we obtain (22). Note in the last inequality of (22), equality holds if and only if α t = 0. This implies G (α t ) T d + ξ ( α t d) ξ ( α t ) > 0 for all d 0, meaning that α t is the optimum of (2). Therefore, (22) indicates that α t is a descent direction when the current iterate is not optimal, because from the Taylor expansion of G we have f ( α t + η α t) f ( α t) = η G ( α t) T α t + O ( η 2) + ξ ( ( α t + η α t)) ξ ( α t) η G ( α t) T α t + O ( η 2) + η ( ξ ( ( α t + α t)) ξ ( α t)) = η t + O ( η 2), 11

12 which implies that when η is small enough and positive, f(α t + η α t ) will be smaller than f(α t ). Regarding the backtracking termination, from the (1/σ)-Lipschitz continuity of g mentioned in Section 2 and the convexity, we have that for any η [0, 1] and any vector d, f ( α t + ηd ) f ( α t) = G ( α t + ηd ) G ( α t) + ξ ( α t ηd ) ξ ( α t) η ( G ( α t) T d + ξ ( α t d ) ξ ( α t)) + η2 Xd 2. (24) 2σ Take d = α t in (24), we see from (22) that it suffices to have η [0, 1] such that η t + η2 X T X 2σ α t 2 η t + η2 X T X 2σ 2 t ητ t (25) c 1 + c 2 to satisfy the Armijo rule (9). We directly see that (25) holds for any η [0, 1] satisfying η (1 τ)σ(c 1 + c 2 ) X T. X Therefore, we have that under Algorithm 1, the step size generated is guaranteed to be ( η t min 1, β(1 τ)σ(c ) 1 + c 2 ) X T. X The lower bound of the step size stated in Lemma 4.1 is proportional to the inverse of the Lipschitz constant for f and can be very small in theory. However, the readers should note that this is a worst-case guarantee that may be much smaller than the real step size obtained by Algorithm 1 in real-world problems. Later in Section 7.3, we show empirically that indeed the obtained step sizes are mostly close to 1. In addition, as discussed in Section 3.2, the cost of backtracking line search is negligible. Therefore, even if the worst case happened, the time spent on backtracking line search will not affect the overall efficiency of our algorithm. This is verified empirically in Section 7.3 as well. We now establish the convergence in the dual problem (2), and will use this result to deduce the convergence in (1) of the primal iterates obtained by (6), which is the kind of convergence we really concern when solving ERM problems. We consider the case that in addition to Assumption 2.1, either of the following assumptions holds. Assumption 4.2 The function ξ is differentiable and its gradient is ρ-lipschitz continuous for some ρ > 0. That is, ξ(z 1 ) ξ(z 2 ) ρ z 1 z 2, z 1, z 2. Assumption 4.3 The loss function ξ is L-Lipschitz continuous for some L > 0. ξ(z 1 ) ξ(z 2 ) L z 1 z 2, z 1, z 2. Moreover, the dual problem satisfies the Kurdyka- Lojasiewicz inequality with exponent 1/2 ( Lojasiewicz, 1963, 1993; Kurdyka, 1998) for some µ > 0. That is, f(α) f min ŝ f(α) ŝ 2 2µ = min s ξ ( α) G(α) + s 2, α Ω. (26) 2µ where f is the optimum objective value of the dual problem (2), and ξ ( α) is the set of subdifferential of ξ at α. 12

13 These assumptions made on the sum ξ are less strict than demanding each ξ i to satisfy certain properties. Note that from (Hiriart-Urruty and Lemaréchal, 2001, Part E, Theorem 4.2.1), when Assumption 4.2 holds, ξ is (1/ρ)-strongly convex. We note that the conditions in Lemma 4.1 is weaker than that of most proximal Newton-type methods such as Tseng and Yun (2009); Lee et al. (2014) as we do not need B t to be positive definite, in which case Q αt B t is still strongly convex when Assumption 4.2 holds. In this situation we can have a broader choice of B t. In Lemma 4.1, consider the choice of B t in (15), since H αt is positive semidefinite, we have that c 1 = a t 2. For c 2, if Assumption 4.2 holds, then since ξ is (1/ρ)-strongly convex, we have that c 2 = c 1 + 1/ρ, and otherwise c 2 = c 1. We first show that Assumption 4.2 implies that (26) holds for the dual problem, by noting that as discussed above, this assumption implies that the dual objective is strongly convex. Lemma 4.4 If f is µ-strongly convex as defined in (4), then it satisfies (26) with the same µ. Proof For any λ (0, 1], and any α 1, α 2, from the definition of strong convexity we have that f(α 1 ) f(α 2 ) µ(1 λ) α 1 α 2 2 f(λα 1 + (1 λ)α 2 ) f(α 2 ) s T (α 1 α 2 ), 2 λ for all s f(α 2 ), where the last inequality is from convexity of f. Let λ 0 +, we have that f(α 1 ) f(α 2 ) s T (α 1 α 2 ) + µ 2 α 1 α 2 2, s f(α 2 ). (27) Now we fix α 2 = α Ω, and minimize both sides of (27) with respect to α 1 to get f f(α) s 2, s f(α). 2µ Since it holds for all s f(α), we have proven (26) with parameter µ because f(α) = G(α) + ξ( α). We need the following definition in our convergence discussion. Definition 4.5 Given any optimization problem min x X f(x) (28) whose minimum is attainable and denoted by f, we say that x X is an ɛ-accurate solution for (28) if f(x) f ɛ. Now we are ready to show the global linear convergence of our algorithm for solving (2). Theorem 4.6 If (26) holds with µ > 0 for the objective of (2), there exists c 3 > 0 such that B t c 3 for all t, and that the conditions in Lemma 4.1 are also satisfied for all iterations for some c 1 c 3 and c 2, then the iterates generated by Algorithm 2 converges Q-linearly, and to obtain an ɛ-accurate solution for (2), it takes at most { (( ( X T ) 2 ) ( ) ) X max O + c σ µ(c 1 + c 2 )τ log, ɛ (( ( X T ) 2 ) X O + c 2 2 X T ( ) )} X 1 3 σ σµ(c 1 + c 2 ) 2 βτ(1 τ) log ɛ iterations. 13

14 Proof We first show the result for the variant of using backtracking line search. From (7), (22) and (9), we have that f(α t+1 ) f(α t ) η t τ c 1 + c 2 α t 2. (29) 2 From the optimality of α t in (8), we get that G (α t ) + B t α t + s t+1 = 0, (30) for some s t+1 ξ ( α t α t ). By convexity, that the step size is in [0, 1], and the condition (26), we have f ( α t+1) f η t ( f ( α t + α t) f ) + (1 η t ) ( f ( α t) f ) η t G ( α t + α t) + s t+1 2 2µ Now to relate the first term to the decrease, we use (30) to get + (1 η t ) ( f ( α t) f ). (31) G(α t + α t ) + s t+1 2 G (α t + α t ) G (α t ) + G (α t ) + s t G (α t + α t ) G (α t ) B t α t 2 ( X T ) 2 X 2 α t B t 2 α t 2, (32) σ where in the second inequality, we used (a + b) 2 2(a 2 + b 2 ) for all a, b, and in the last inequality we used Lipschitz continuity of G. We therefore get the following by combining (31), (32), and (29). f ( ( ( X α t+1) f η t T X µ σ ( ( X T X σ ) 2 + c 2 3 ) 2 + c 2 3 ) ) α t 2 + (1 η t ) ( f ( α t) f ) 2 ( ( f α t ) f ( α t+1)) + (1 η t ) ( f ( α t) f ). µ(c 1 + c 2 )τ (33) Let us define c 4 := ( ( X T X σ ) 2 + c 2 3 ) 2 µ (c 1 + c 2 ) τ, then rearranging (33) gives (f(α t+1 ) f ) (1 η t + c 4 ) 1 + c 4 (f(α t ) f ). (34) Combine the above result with the lower bound of η t from (23) proves the global Q-linear convergence, and the number of iterations needed to get an ɛ-accurate solution can be directly deduced from (34). Now consider the exact line search variant. We have that this variant results in an objective no larger than the left-hand side of (34) so all the results hold for the exact line search variant as well, thus finishing the proof. From Lemma 4.4, both Assumptions 4.2 and 4.3 satisfy the conditions required in Theorem 4.6. We note that in the final expression (34), a larger step size suggests a faster convergence for a fixed c 4. Therefore, it makes sense to conduct line search to try to find a larger η t instead of using a fixed and conservative one that guarantees (9) can be satisfied. On the other hand, one can tune the 14

15 value of c 4 by selecting different B t, but we should note that a B t that makes c 1 and c 2 larger may also make c 3 larger, and the contribution of c 3 to c 4 is quadratic, thus the choice should be made cautiously. However, the readers should note that the results in Theorem 4.6 are just the worst case iteration complexity, and empirically we often observe much faster convergence. In particular, this type of methods that consider a quadratic approximation of the smooth term normally result in global iteration complexity worse even than the non-accelerated gradient method as the proof techniques utilize (implicitly) the relation between the obtained update direction and the direction used in the gradient method. Moreover, this sort of iteration complexity result is controlled by the worst case of function value decrease in one iteration, so it is not a good predictor of the overall behavior. For example, the global iteration complexity of the Newton method and the quasi-newton methods are all worse than first-order methods in theory as they only have better local convergence rate, but in practice they are much faster because the worst case is extremely rare in real-world problems. This situation also holds true for these distributed optimization methods for (2). For example, the iteration complexity shown in Ma et al. (2017) is no better than a plain distributed implementation of the proximal gradient method, but the empirical performance is much better. Similarly, if we want to optimize c 4, then the best choice for (15) is a t 1 = 0, a t 2 = O( X T X /σ), which reduces our algorithm to the proximal gradient method. However, as one can expect, if the choice of B t is closer to the real Hessian of g then the empirical convergence should be faster as the approximation models the real function value decrease behavior better. The next results link the above linear convergence result for the dual problem to the convergence rate of the primal problem (1). Theorem 4.7 If Assumption 4.2 holds, then for any ɛ > 0 and any ɛ-accurate solution α for (2), the corresponding w obtained through (6) is (ɛ(1 + ρ X T X /σ))-accurate. If Assumption 4.3 holds, then for any ɛ > 0 and any ɛ-accurate solution α for (2), the corresponding w obtained through (6) is (min{4 X T X L 2 /σ, 8ɛ X T X L 2 /σ})-accurate. Proof Our proof consists of using α as the initial point, applying one step of some primal-dual algorithm, then utilizing the algorithm-specific relation between the decrease in one iteration and the duality gap to obtain the bound. Finally, we notice that the decrease of the dual objective in one iteration of any algorithm is upper-bounded by the distance to the optimum from the current objective, and that the primal sub-optimality is upper-bounded by the duality gap. Therefore we will obtain an algorithm-independent result from some algorithm-specific results. When Assumption 4.2 holds, the primal problem is in the type of problems considered in Shalev- Shwartz and Zhang (2012), and we have that ξ is (1/ρ)-strongly convex. If we take α as the initial point, and apply one step of their method to obtain the next iterate α +, from (Shalev-Shwartz and Zhang, 2012, Lemma 1), we get ɛ = f (α) f (α ) f (α) f ( α +) where w is the optimal solution of (1), s ( f P (w(α)) + f (α) ) s2 G s 2σ s ( f P (w(α)) f P (w ) ) s2 G s, s [0, 1], (35) 2σ G s := ( X T X σ(1 s) ) u α 2, sρ and u i ξ i (X T i w(α)). 15

16 To remove the second term in (35), we set This then gives X T X σ(1 s) sρ = 0 s = σ σ + ρ X T [0, 1]. X (1 + ρ XT X )ɛ f P (w(α)) f P (w ). σ Note that in (Shalev-Shwartz and Zhang, 2012, Lemma 1), the result is for the expected value of the dual objective decrease at the current iteration and the expected duality gap of the previous iteration. However, for the initial point, the expected duality gap is a constant, and the expected function decrease cannot exceed the distance from the current objective to the optimum. When Assumption 4.3 holds, the primal problem falls in the type of problems discussed in Bach (2015). If we take α as the initial point, and apply one step of their method to obtain the next iterate α +, from the last inequality in the proof of Proposition 4.2 in Bach (2015) and weak duality, where ɛ s ( f P (w(α)) f P (w ) ) (sr)2, s [0, 1], (36) 2σ R 2 = max α, ˆα Ω X(α ˆα) 2 X T X max α, ˆα Ω (α ˆα) 2 = 4 X T X L 2. (37) In the last equality we used (Rockafellar, 1970, Corollary ) such that if φ( ) is L-Lipschitz continuous, then the radius of dom(φ ) is no larger than L. The right-hand side of (36) is concave with respect to s, hence we can obtain the maximum of it by setting the partial derivative with respect to s to zero. By defining the maximizer as ŝ, this gives ŝ = arg max s [0,1] s ( f P (w) f P (w ) ) (sr)2 2σ = min{1, σ(f P (w) f P (w )) R 2 }. If ŝ = 1, we have and thus On the other hand, if ŝ 1, we get R 2 σ(f P (w) f P (w )), f P (w) f P (w ) R2 σ. 2R 2 ɛ σ (f P (w) f P (w )) 2. These conditions and (37) indicate that f P (w) f P (w ) min{ R2 2ɛR σ, 2 σ } X L 2 8ɛ XT X L min{4 XT, 2 }. σ σ Corollary 4.8 If we apply Algorithm 2 to solve a regularized ERM problem that satisfies either Assumption 4.2 or Assumption 4.3, then the primal iterates w t obtained from the dual iterates α t via (6) converges R-linearly. 16

17 Proof If Assumption 4.2 holds, Theorem 4.6 shows that the dual objective converges Q-linearly. Therefore, given any ɛ > 0, from Theorem 4.7, it suffices to have an (ɛ(1 + ρ X T X /σ))-accurate dual solution to transform to an ɛ-accurate primal solution, which takes ( ( )) ( 1 O log = O log 1 ) ɛ ɛ(σ+ρ X 2 ) σ iterations, showing R-linear convergence of the primal iterates. Note that here we omit all factors independent of ɛ to show the linear rate with respect to ɛ, while the number of iterations needed can be calculated by combining Theorems 4.6 and 4.7. On the other hand, if Assumption 4.3 holds, we assume ɛ < 4 X 2 L/σ, for otherwise it does not take any iteration to get an ɛ-accurate solution for the primal problem since α = 0 gives the required accuracy. Therefore, from Theorem 4.7 we need an O(ɛ 2 )-accurate dual solution to make the corresponding primal solution ɛ-accurate. Thus by Theorem 4.6, we need iterations. O ( log ( 1 ɛ 2 )) = O ( 2 log 1 ɛ ) = O ( log 1 ɛ ) The results in Theorem 4.7 and Corollary 4.8 are implied by existing works (Bach, 2015; Shalev- Shwartz and Zhang, 2012), and the calculations take very little effort. It is also not too difficult to obtain sublinear rates following similar techniques for problems not satisfying (26), but we omit them to simplify and focus our description. Our analysis above discusses the situation of exactly minimizing (8) at each round, while finding the exact solution when B t is not diagonal may be impractical. However, for most algorithms that one wants to use for solving (8), provided that the objective function is strongly convex, global linear convergence can be obtained. Therefore, by the convergence rate of the algorithm applied to solve (8), we can ensure that the problem is solved at least to some extent such that the objective is at least smaller than a certain negative value depending on the optimal objective value of (8). One can then ensure that in this case t is still bounded by some amount related to the norm of the obtained update direction, and thus the approximate solution is still a descent direction. Hence we can find another matrix B t such that the update direction is the solution of (8) with this matrix, because the degree of freedom of the matrix is large enough. Therefore the conditions in Lemma 4.1 and Theorem 4.6 are still satisfied. The key requirement here is that we need to control how exact or how loose the solutions of (8) at different iterations are in order to have lower- and upper-bounded B t over iterations to find c 1, c 2, c 3. We omit the proof details of these cases as the convergence theorems are just to provide a linear convergence guarantee regardless of the parameters in the worst case, and we also observe from the experiments that the empirical convergence speed is much faster than the theoretical one in the worst-case analysis. 5. Related Works Our algorithm can be viewed from two different aspects. If we simply consider solving (8), then it is similar to proximal (quasi-)newton methods, with some specific pick of the second-order approximation. A generalization of it appears as the block coordinate descent method (Tseng and Yun, 2009), where the proximal quasi-newton method is the special case that there is only one block of variables. One thing worth noticing is that Tseng and Yun (2009) requires the matrix in (8) to be positive definite with a positive lower bound of the smallest eigenvalue over all iterations. We relaxed this condition to allow B t be indefinite or positive semidefinite when the ξ term is strongly convex. This condition is used when Assumption 4.2 holds, and in this case we do not need to add a damping term in our second order approximation. Namely, we can set a t 2 0 in (15). 17

18 On the other hand, our focus is on how to devise a good approximation of the Hessian matrix of the smooth term to work efficiently for distributed optimization. Works focusing on this direction for dual ERM problems include Pechyony et al. (2011); Yang (2013); Ma et al. (2017). The work Pechyony et al. (2011) discusses how to solve the SVM dual problem distributedly. This problem is a special case of (2), see Section 6.1 for more details. They proposed a method called DSVM-AVE that iteratively solves (8) with B t defined in (15) with a t 1 1, a t 2 0 to obtain the update direction, while the step size η t is fixed to 1/K. Though they did not provide theoretical convergence guarantee in Pechyony et al. (2011), we can see the reasoning of this choice by (24). First, in the case of SVM, the objective is quadratic, with 2 g (v) I. Thus one can easily see that Xd 2 Kd T Hα d, d, (38) with the equality holds when x i, i = 1,..., N, are identical and d is a multiple of the vector of ones. Therefore, taking λ = 1/K in (24) and plug in the bound in (38), we can see that since σ = 1 in the SVM case, minimizing (8) with a step size of 1/K leads to decrease of the objective value. In Yang (2013), the algorithm DisDCA is proposed to solve (2) under the same assumption that g is strongly convex. They consider the case c i 1 for all i, but the algorithm can be directly generalized to c i > 1 easily. This method uses a specific algorithm, the stochastic dual coordinate descent (SDCA) method (Shalev-Shwartz and Zhang, 2013), to solve the local sub-problems, while the choice of B t is picked according to the algorithm parameters. To solve the sub-problem on machine k, each time SDCA samples one entry i k from J k with replacement and minimizes the local objective with respect to α ik. If each time machine k samples m k entries, and we denote m := K m k, then the first variant of DisDCA, called the basic variant in Yang (2013), picks B t in (8) as { m σ xt i x j if x i, x j are from the same X k for some k that is sampled, (B t ) i,j = 0 else, k=1 and the step size is fixed to 1. In this case, it is equivalent to splitting the data into l blocks, and the minimization is conducted only with respect to the blocks sampled. If we let I be the indices not sampled, then following the same reasoning of (38), we have Xd 2 d T B t d, d such that d I = 0, (39) where the equality holds when all X are identical and I = l m. Therefore, by combining (39) and (24), it is not hard to see that in this case minimizing Q α B t directly results in a certain amount of function value decrease. The analysis in Yang (2013) then shows that the primal iterates {w t } obtained by substituting the dual iterates {α t } into (6) converges linearly to the optimum when all ξ i have Lipschitz continuous gradient and converges with a rate of O(1/ɛ) when all ξ i are Lipschitz continuous by using some proof techniques similar to that in Shalev-Shwartz and Zhang (2012). As we noted in Section 4, this is actually the same as showing the convergence rate of the dual objective and then relate it to the primal objective. The key difference to our analysis is that they do not assume the further structure (26) of the dual problem when ξ is not strongly convex, ergo the sublinear rate. The second approach in Yang (2013), called the practical variant, considers { K (B t ) i,j = σ xt i x j if π(i) = π(j), 0 else, 18

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM ICML 15 Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

arxiv: v2 [cs.lg] 10 Oct 2018

arxiv: v2 [cs.lg] 10 Oct 2018 Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

A Magiv CV Theory for Large-Margin Classifiers

A Magiv CV Theory for Large-Margin Classifiers A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Homework 3. Convex Optimization /36-725

Homework 3. Convex Optimization /36-725 Homework 3 Convex Optimization 10-725/36-725 Due Friday October 14 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Journal of Machine Learning Research 9 (2008) 1369-1398 Submitted 1/08; Revised 4/08; Published 7/08 Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Kai-Wei Chang Cho-Jui

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Distributed Optimization with Arbitrary Local Solvers

Distributed Optimization with Arbitrary Local Solvers Industrial and Systems Engineering Distributed Optimization with Arbitrary Local Solvers Chenxin Ma, Jakub Konečný 2, Martin Jaggi 3, Virginia Smith 4, Michael I. Jordan 4, Peter Richtárik 2, and Martin

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

You should be able to...

You should be able to... Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization

Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth

More information

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014 Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09 Numerical Optimization 1 Working Horse in Computer Vision Variational Methods Shape Analysis Machine Learning Markov Random Fields Geometry Common denominator: optimization problems 2 Overview of Methods

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a

More information

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning,

More information

Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations

Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations Newton Method with Adaptive Step-Size for Under-Determined Systems of Equations Boris T. Polyak Andrey A. Tremba V.A. Trapeznikov Institute of Control Sciences RAS, Moscow, Russia Profsoyuznaya, 65, 117997

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Iteration Complexity of Feasible Descent Methods for Convex Optimization

Iteration Complexity of Feasible Descent Methods for Convex Optimization Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Line Search Methods for Unconstrained Optimisation

Line Search Methods for Unconstrained Optimisation Line Search Methods for Unconstrained Optimisation Lecture 8, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Generic

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

The Common-directions Method for Regularized Empirical Risk Minimization

The Common-directions Method for Regularized Empirical Risk Minimization The Common-directions Method for Regularized Empirical Risk Minimization Po-Wei Wang Department of Computer Science National Taiwan University Taipei 106, Taiwan Ching-pei Lee Department of Computer Sciences

More information

Interior-Point Methods for Linear Optimization

Interior-Point Methods for Linear Optimization Interior-Point Methods for Linear Optimization Robert M. Freund and Jorge Vera March, 204 c 204 Robert M. Freund and Jorge Vera. All rights reserved. Linear Optimization with a Logarithmic Barrier Function

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

A Second-Order Method for Strongly Convex l 1 -Regularization Problems

A Second-Order Method for Strongly Convex l 1 -Regularization Problems Noname manuscript No. (will be inserted by the editor) A Second-Order Method for Strongly Convex l 1 -Regularization Problems Kimon Fountoulakis and Jacek Gondzio Technical Report ERGO-13-11 June, 13 Abstract

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Solving the SVM Optimization Problem

Solving the SVM Optimization Problem Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

No EFFICIENT LINE SEARCHING FOR CONVEX FUNCTIONS. By E. den Boef, D. den Hertog. May 2004 ISSN

No EFFICIENT LINE SEARCHING FOR CONVEX FUNCTIONS. By E. den Boef, D. den Hertog. May 2004 ISSN No. 4 5 EFFICIENT LINE SEARCHING FOR CONVEX FUNCTIONS y E. den oef, D. den Hertog May 4 ISSN 94-785 Efficient Line Searching for Convex Functions Edgar den oef Dick den Hertog 3 Philips Research Laboratories,

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Written Examination

Written Examination Division of Scientific Computing Department of Information Technology Uppsala University Optimization Written Examination 202-2-20 Time: 4:00-9:00 Allowed Tools: Pocket Calculator, one A4 paper with notes

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information