Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization

Size: px

Start display at page:

Download "Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization"

Brooke White
6 years ago
Views:

1 Distributed Block-diagonal Approximation Methods for Regularized Empirical Risk Minimization Ching-pei Lee Department of Computer Sciences University of Wisconsin-Madison Madison, WI , USA Kai-Wei Chang Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA Editor: Michael Mahoney Abstract In recent years, designing distributed optimization algorithms for empirical risk minimization (ERM) has become an active research topic, mainly because of the practical need to deal with the huge volume of data. In this paper, we propose a general framework for training an ERM model by solving its dual problem in parallel over multiple machines. Viewed as special cases of our framework, several existing methods can be better understood. Our method provides a versatile approach for many large-scale machine learning problems, including linear binary/multiclass classification, regression, and structured prediction. We show that our method, compared with existing approaches, enjoys global linear convergence for a broader class of problems and achieves faster empirical performance. 1. Introduction With the rapid growth of data volume and model complexity, designing efficient learning algorithms has become increasingly important. Distributed optimization techniques, which decompose a large optimization problem into sub-problems and distribute the computational burden across multiple machines, have gained a great amount of interest. This type of approaches are particularly useful when the optimization problem involves massive computation and/or the data set is stored on multiple machines as it could not fit the capacity of a single node. However, the communication cost and the asynchronous nature of the process challenge the design of efficient optimization algorithms in the distributed environment. In this paper, we study distributed optimization algorithms for training machine learning models that can be represented under the regularized empirical risk minimization (ERM) framework. These models include binary/multi-class classification, regression, and structured prediction models, covering a variety of learning tasks. We specifically focus on linear models, which have been shown successful in dealing with large-scale data thank to their efficiency and interpretability. 1 Given a set of training data, {X i } l i=1, X i R n ci, c i N, where l, n > 0 are the number of instances and the dimension of the model respectively, regularized ERM models solve the following optimization problem: min f P (w) := g(w) + w R n 1. Linear models allow us to interpret the value of each feature from the model. l ( ξ i X T i w ). (1) i=1 1

2 In the literature, g and ξ i are usually called the regularization term and the loss term, respectively. We assume that f P is a proper, closed convex function. The definition in problem (1) is general and covers a variety of learning problems. To unify the definitions of different learning problems, we encoded the true labels (i.e., y i Y i ) in the loss term ξ i and the input data X i. For some learning problems, the space of X i is spanned by a set of variables whose size may vary for different i. Therefore, we represent X i as an n c i matrix. For example, in the part of speech tagging task, where each input sentence consists of several words, c i represents the number of words in the i-th sentence. We provide a detailed discussion of the loss terms for different learning problems in Section 6. Regarding the regularization term, common choices of g include the squared-l 2 norm, the l 1 norm, and the elastic net (Zou and Hastie, 2005) that combines both. In many applications, it is preferable to solve the Lagrange dual problem of problem (1), because the dual problem has better mathematical properties, making optimization easier. The Lagrange dual problem of (1) is where min α Ω f(α) := g (Xα) + X := [X 1,..., X l ], α := l ξi ( α i ), (2) α i R ci is the dual variable associated with X i, and for any function h( ), h ( ) is its convex conjugate: h (w) := z T w h(z), w. max z dom(h) The domain is defined to be Ω := l i=1 dom(ξ i ) R l i=1 ci. We use the following notations to simplify our description. i=1 α 1. α l, l l ξ(x T w) := ξ i (Xi T w), ξ ( α) := ξi ( α i ), i=1 i=1 G (α) := g (Xα). In the single-core case, optimization methods for the dual ERM (2) have been widely studied (see, for example, Yuan et al. (2012) for a review). However, most of the state-of-the-art single-core algorithms for the dual ERM are inherently sequential and hard to be parallelized. Moreover, in multi-machine environments, the cost of communication is relatively high; therefore, it is usually the bottleneck for a distributed optimizer. As a result, adapting these single-core methods to distributed environments for solving (2) is nontrivial. In particular, it is important to reduce the communication cost to make the optimization efficient. Consequently, algorithms with faster convergence rates are desirable for distributed optimization, because fewer iterations implies fewer communication rounds and thus lower communication cost. We observe that without careful consideration of this issue, existing distributed dual solvers do not achieve satisfactory training speed. Existing distributed algorithms like Chen et al. (2014); Zhuang et al. (2015); Lin et al. (2014); Zhang and Lin (2015); Lee et al. (2015b, 2017) often focus on using the gradient and/or the Hessian information to solve the primal form of ERM (1), whose computation either can be parallelized naturally or require only some modification in the implementation to adapt for distributed environments. However, in this paper, we show that with a careful design, dual approaches can be competitive with primal approaches and enjoy sound theoretical properties. Some methods focusing on theoretical communication efficiency reduce the number of communication rounds significantly but are impractical to use because they either require additional assumptions on data points distribution or have excessively high computational cost per iteration. 2

3 On the contrary, our focus in this work is designing a practical algorithm with strong empirical performance in terms of overall running time for practitioners. In particular, it is well-known that first-order methods can achieve better theoretical iteration complexity, but second-order methods like truncated Newton or (limited-memory) BFGS tend to need fewer number of iterations in practice, leading to relatively short running time. We, therefore, design a practically efficient approach that is similar to the second-order methods but without requiring lengthy rounds of communication for computing (approximate) Hessian at each iteration. Our distributed learning algorithm aims for solving (2). At each iteration, our framework minimizes a problem consisting of a second-order approximation of the non-separable g (Xα) term in (2) with the quadratic term being a positive semidefinite block-diagonal matrix, plus the original separable loss conjugate ξ ( α). We consider the setting that the training instances are distributed across K machines, where the instances on machine k are {X i } i Jk. In our setting, J k are disjoint index sets such that K J k = {1,..., l}, J i J k = φ, i k. (3) k=1 No further assumption on how those instances are distributed across machines is imposed. Therefore, the data distributions on different machines can be totally different. Without loss of generality, we assume that j 0 := 0, j K := l, J k := {j k 1 + 1,..., j k }, k = 1,..., K. For any vector v R l, v Jk denotes the sub-vector in R Jk that contains the coordinates of v indexed by J k. When not specified, always denotes the Euclidean norm. We discuss how to choose the approximation to make our method efficient in terms of empirical convergence and communication. After solving the approximated problem, we conduct a line search that only requires negligible computational cost and O(1) communication to ensure sufficient decrease of the function value. With the line search procedure, our algorithm achieves faster empirical performance. By utilizing a relaxed condition, our method is able to achieve global linear convergence for many problems whose dual objective is non-strongly convex, including the popular support vectors machines (SVM) model proposed by Boser et al. (1992); Vapnik (1995), and the structured SVM (SSVM) problem (Tsochantaridis et al., 2005; Taskar et al., 2004). In other words, our algorithm only requires O(log(1/ɛ)) iterations, or equivalently, rounds of communication, to obtain an ɛ-accurate solution for (2). The theoretical analysis then shows this result is equivalent to that obtaining an ɛ-accurate solution for the original ERM problem (1) also takes only O(log(1/ɛ) iterations. We also discuss the main differences between our approach and existing distributed algorithms for (2). Experiments demonstrate that our algorithm is significantly faster than existing distributed solvers for (2). We note that Zheng et al. (2017) recently proposed an accelerated method for solving the dual ERM problem in a distributed setting. Their acceleration technique from Shalev-Shwartz and Zhang (2016) is similar to the catalyst framework for convex optimization (Lin et al., 2015). In essence, at every iteration, they add a term κ/2 w z 2 2 to (1) and approximately solve the dual problem of the modified primal problem by existing distributed methods for (2). The solution is then used to generate z for the next iteration. Like the catalyst framework that can be combined with any convex optimization algorithm, this acceleration technique in Zheng et al. (2017) can also be applied on top of our framework easily by letting the solver of the modified dual problem be our algorithm. Therefore, we focus our comparison on methods that directly solve the original optimization problem (2), since methods that are faster for the original problem are expected to perform better as well when combined with the acceleration technique. Special cases of our framework were published earlier as conference and workshop papers (Lee and Roth, 2015; Lee et al., 2015a). In this work we unify the results and extend the previous work to a general setting. We also provide more thorough theoretical and empirical analysis. The remaining of the paper is organized as follows. We describe the algorithm overview in Section 2. 3

4 The implementation details for distributed environments are shown in Section 3. In Section 4, we analyze the convergence property of our algorithm. Section 5 discusses related works for distributed ERM optimization. Application of our algorithm on different ERM problems is described in Section 6. Numerical experiments are provided in Section 7. Some possible extensions and limitations are discussed in Section 8. We then make some concluding remarks in Section 9. The program used in this work is available at 2. Algorithm Throughout this work, we make the following assumption. Assumption 2.1 There exists σ > 0 such that the regularizer g in the primal problem (1) is σ- strongly convex. Namely, g(αw 1 +(1 α)w 2 ) αg(w 1 )+(1 α)g(w 2 ) σα(1 α) w 1 w 2 2, w 1, w 2 R n, α [0, 1]. 2 (4) Note that the goal of solving the dual problem is to obtain a solution to the original primal problem (1). It can easily be shown that strong duality between (1) and (2) holds, which means that any pair of primal and dual optimal solutions (w, α ) satisfies the following relation. f P (w ) = f (α ). By (Hiriart-Urruty and Lemaréchal, 2001, Part E, Theorem 4.2.2), since g is σ-strongly convex, g is differentiable and has (1/σ)-Lipschitz continuous gradient everywhere. This also indicates that even if g is extended-valued, g is still finite everywhere, hence the only constraint of the feasible region is α Ω. From the KKT optimality conditions, we further have w = g (Xα ). (5) Although (5) only holds at the optimum, we can still define w(α) as the primal solution associated with α by the same relation for any α feasible for (2): w(α) := g (Xα). (6) Our algorithm is an iterative descent method for solving (2). Starting with any arbitrary feasible α 0, it generates a sequence of iterates {α 1, α 2,... } with the property that f(α i ) f(α j ) if i > j. The iterates are updated by a direction α t and a step size η t 0. α t+1 = α t + η t α t, t 0. (7) In the objective function of (2), the elements in ξ usually can be computed separately. Therefore, it can be optimized directly in a coordinate-wise manner. However, g tends to be complex and hard to optimize. Thus, we use a second-order approximation that results in an easier to solve problem as a surrogate. As we mentioned above, g and thus G are differentiable and the gradients are Lipschitz continuous. Given the current iterate α t, we solve α t := arg min α Q αt B t ( α) := G (α t ) T α ( α)t B t α + ξ ( α t α) (8) for some symmetric B t, and then take α t as the update direction. The matrix B t can be varied over iterations and has a wide range of choices, depending on the goal. One usually wants to pick B t such that (8) is easy to solve to reduce the cost per iteration, or that Q αt B t ( α t ) is a tighter 4

5 approximation of f(α t + α t ) f(α t ) so that it takes fewer iterations to obtain a good solution. The extreme case of the former is to use a diagonal B t while the other extreme end for the latter is using B t as the Hessian. We will show in Section 4 that as long as the whole objective of (8) is strongly convex and that B t is positive semidefinite, the obtained α t will be a descent direction. We then consider two different line search possibilities to obtain the step size η t. The first is the exact line search. η t = arg min η R f(αt + η α t ). However, this approach can be conducted only when the objective function has some specific structure to allow so. In the general case, we consider a backtracking line search with the Armijo rule. Given β (0, 1), τ (0, 1), this procedure finds the smallest i 0 such that η = β i satisfies f(α t + η α t ) f(α t ) + ητ t, (9) where and take η t = η. t := G (α t ) T α t + ξ ( α t α t ) ξ ( α t ), (10) 3. Distributed Implementation for Dual ERM We now discuss how to adapt the algorithm framework in Section 2 to make distributed optimization for the dual ERM efficient. In particular, we will discuss the choice of B t in (8), the trick of making line search efficient, and how to solve (8) with the minimum amount of communication. For the ease of algorithm description, we will use the following representation of X and α. where X = [x 1,..., x N ], α = [α 1,..., α N ] T, (11) N := is the number of columns of X as well as the dimension of α. The index sets for the representation (11) corresponding to J k is denoted by J k {1,..., N}, k = 1,..., K. We define π(i) = k if i J k. 3.1 Update Direction In this section, we discuss how to select B t such that the objective of (8) is strongly convex, the subproblem can be solved efficiently without communication, and that the resulting problem is a good approximation of (2). Notice that we distribute the data to multiple machines, and the k-th machine only stores and handles X i and the corresponding α i for i J k. Therefore, in order to reduce the communication cost, we need to pick B t in a way that (8) can be decomposed into independent sub-problems, and each sub-problem only involves a subset of the data. In such way, each subproblem can be solved locally on one machine without communicating with others. According to the partition (3), we should consider a block-diagonal B t such that l i=1 c i (B t ) i,j = 0, if π(i) π(j). (12) Any B t satisfying (12) falls in the block-diagonal approximation framework of our method. In most cases, we would like to incorporate higher-order information as much as possible to obtain fast convergence, which will then reduce the rounds of expensive communication. A natural choice would then be the Hessian matrix. H α t := 2 G (α t ) = X T 2 g ( Xα t) X. (13) 5

6 However, the Hessian matrix is usually dense and does not satisfy (12). Therefore, we consider its block-diagonal approximation H α t. { ( (H Hα t) α t) = i,j if π(i) = π(j), (14) i,j 0 otherwise. When the function g is complicated such that 2 g cannot be calculated easily, one can use the identity matrix, or some diagonal approximation, to substitute it in (13). Note that since ( 2 G (Xα t )) i,j = x T i 2 g (Xα t )x j, if each machine maintains the information of Xα t, entries of (14) can be decomposed into parts that each only requires information from the data enumerated in one index set J k. Thus the sub-problems can indeed be solved independently on corresponding machines without communication. Another concern of using the Hessian matrix is that it might be only positive semidefinite, so that the requirement that the objective of (8) needs to be strongly convex is not satisfied when ξ ( α) is non-strongly convex. In this case, we can add a damping term to B t to ensure the strong convexity. Therefore, our choice for B t can be generalized to the following formulation. B t = a t 1 H αt + a t 2I, for some a t 1, a t 2 0. (15) The choices of a t 1 and a t 2 might depend on the problem structures and the applications. In most cases, a t 2 = 0 should be considered, especially when it is known that either ξ ( α) is already strongly convex, or H αt is positive definite. For a t 1, practical results (Pechyony et al., 2011; Yang, 2013) suggest that a t 1 [1, K] leads to fast convergence, while we prefer a t 1 1 as it is a closer approximation of the Hessian matrix. One worth noticing special case is when a t 1 = 0, our framework reduces to proximal gradient. In solving (8) with our choice (15) and a t 1 0, each machine needs the information of Xα t to calculate both (B t ) Jk,J k and ( G ( α t)) = X T J k J k g ( Xα t). (16) Therefore, after each iteration of updating α t, we need to synchronize the information v t := Xα t = K k=1 j J k x j α t j through inter-machine communication. This communication of an O(n) vector is more suitable than communicating either the Hessian or the whole X together with α t. However, because we need to use the update direction to conduct line search, the better approach is to synchronize the following vector instead. K v t := X α t = x j αj, t (17) k=1 j J k and then update v t+1 = v t + η t v t locally after the step size η t is determined. More explanations of communicating this vector will be provided in the next section when we discuss the details of conducting line search. 6

7 3.2 How to Conduct Line Search Efficiently After the update direction α is decided from solving (8), we need to conduct line search to ensure sufficient function value decrease. For the backtracking variant, On the right-hand side of (9), the first term is already available from the previous iteration, and we thus only need to calculate (10). From (16), this value can be calculated by t = g ( v t) T v t + ( ξ ( α t α t) ξ ( α t)). (18) To obtain the terms related to ξ, we need an O(1) communication, and no additional computation is needed because the information of ξ ( α t α t ) can be maintained in the process of solving (8) easily. For the first term in (18), because each machine has full information of v t, v t and hence g (v t ), the calculation can be conducted distributedly. Thus we can combine the local partial sums of all terms in (18) in one scalar to communicate. One can also see from this calculation that synchronizing v t is inevitable to compute the required value efficiently. For the left-hand side of (9), we use the form (2) to illustrate. The ξi term is calculated distributedly in nature, where normally each term only costs O(1). For the g (v) term, if it is separable, then the computation is also parallelizable. Further, we may even have a closed form solution to calculate it in O(1) time for different η if some values are precomputed in some special cases. For example, when g (v) = 1 2 v 2, we have that g (v + η v) = 1 2 ( ) v 2 + η 2 v 2 + 2ηv T v. (19) In this case we can compute v 2 and v T v distributedly in advance, then the calculation of (19) with different η requires only O(1) computation and does not need any communication. For the general case, though the computation might not be this low, by maintaining both v and v, the calculation of g(v + η v) requires no more communication but only at most O(n) computation locally, and this cost is negligible as other parts of the algorithm already involves more expensive computations. The line search procedure is summarized in Algorithm 1. For the exact line search variant, it is possible only when f(α + η α) η has an analytic solution, which requires both g and ξ to be differentiable at least in some open set. For example, when the objective of f is quadratic, the optimal step size can be obtained by f(α + η α) η = 0 = 0 η = f(α)t α α T 2 f(α) α, (20) and then project back to the largest η [0, 1] such that α + η α Ω. Another way to approximate the exact line search variant is to consider a bisection method if the above assumption for analytic solution does not hold. In this case, we can utilize the trick of maintaining both v and v mentioned above to reduce the communication cost of re-evaluating the objective value. 3.3 Local Solver at Each Machine When we pick B t satisfying (12), the problem (8) can be decomposed into K independent subproblems of the following form. min G (α t ) T α Jk R J k J k α Jk αt J k (B t ) Jk,J k α Jk + ξ ( α t i α i ). (21) i J k 7

8 Algorithm 1: Distributed backtracking line search Input: α, α R l, β (0, 1), τ (0, 1), f(α) R, v = Xα, v = X α Form a partition {Ĵk} K k=1 of {1,..., n} Distributedly calculate t : O(1) communication K t = g ( v t) ( Ĵ v t ) + ξ k Ĵ k j ( α j + α j ) ξj ( α j ). j J k k=1 Let η 1 Calculate f(α + η α) using v and η v while f(α + η α) > f(α) + ητ t do η ηβ Calculate f(α + η α) using v and η v end Output: η, f(α + η α) O(1) communication O(1) communication Note that since all the information needed in (21) is available on machine k, the sub-problems can be solved without any inter-machine communication. Our framework does not limit the solver for (8). Instead, one can take an arbitrary local solver that is suitable for the specific problem. For example, we can consider (block) coordinate descent, (accelerated) proximal methods, and so on. Because each sub-problem is very close to the original dual problem of regularized ERM shown in (2), the cost is also similar, and one can consider methods that are efficient in the single-machine setting. In our experiment, we will use random-permuted cyclic coordinate descent for the dual ERM (Hsieh et al., 2008; Yu et al., 2011; Chang and Yih, 2013) as our local solver, since this approach has been proven empirically to be efficient in the single-core setting. Note that although the theoretical convergence rate of cyclic coordinate descent can be as worse as O(l 2 ) times slower than its randomized, non-cyclic counter part (Sun and Ye, 2016), its random-permuted version is known to behave similarly to the randomized and non-cyclic version, while preserving the deterministic convergence guarantee of cyclic coordinate descent. Other options can be adopted in discretion for specific problems or data sets, but such discussion is beyond the scope of this work. 3.4 Output the Best Primal Solution The proposed algorithm is a descent method for the dual problem (2). In other words, it is guaranteed that f(α t1 ) < f(α t2 ) provided t 1 > t 2. However, when we obtain the primal iterates w t from (6), there is no guarantee that the corresponding primal objective is strictly decreasing. Although we are able to prove that the primal objective converges R-linearly in Section 4, we still cannot guarantee that the latest primal iterate is the one with the lowest primal objective. This situation happens for all methods that solve the dual problem. To deal with this problem, we keep track of the primal objective of each iterate, and when the algorithm is terminated, we take the iterate with the lowest primal objective as the final model. This is known as the pocket approach in the literature of the perceptron algorithm (Gallant, 1990). 3.5 Stopping criterion If we consider only the sub-optimality of f(α t ), then we can use some simple indicators of suboptimality such as the norm of the update direction, or the size of t, whose values are zero at 8

9 Algorithm 2: Distributed block-diagonal approximation method for the dual ERM problem (2). Input: a 1, a 2 0 but not both 0, a feasible α 0 for (2), ɛ 0 f, w 0 Compute v 0 = K k=1 j J k x j α 0 j and ξ ( α 0 ) O(n) communication Compute f(α 0 ) by v 0 and ξ ( α 0 ) for t = 0, 1, 2,... do Compute f P (w(α t )) by (6) O(1) communication if f P (w(α t ) < f then f f P (w(α t ), w w(α t ) end if f(α t ) + f P (w(α t )) ɛ(f(α 0 ) + f P (w(α 0 ))) then Output w and terminate end Each machine obtains α t J k by solving the corresponding block of (8) independently and in parallel using the local data, with B decided by (15) Communicate v t = K k=1 ( j J k x j α t j Variant I: Conduct line search through Algorithm 1 to obtain η t Variant II: Solve for η t arg min η f(α t + η α t ) Each machine conducts in parallel: α t+1 J k end ) O(n) communication α t J k + η t α t J k, v t+1 v t + η t v t the optima, as a practical stopping condition. However, since we have already computed the primal objective at each iteration as discussed above, we can directly use this as the stopping criterion with no additional cost, and this criterion should be a more direct indicator of the model quality than others. Therefore, we consider the following stopping criterion: f(α t ) + f P (w(α t )) ɛ ( f ( α 0) + f P ( w ( α 0))), with some user-specified ɛ 0. The overall procedure for distributedly optimizing (2) discussed in this section is described in Algorithm Cost per Iteration Here we summarize the cost per iteration of our algorithm. We will individually discuss the time spent on forming the quadratic problem using (15), solving (8), conducting line search, and maintaining the best primal solution, then summarize them together. For the ease of description and analysis, we assume that the data entries are distributedly evenly across machines. Namely, we assume that the number of columns of X on each machine is O(N/K), and each machine has O(#nnz/K) nonzero entries, where #nnz is the number of non-zero elements in X. We will separate the costs for computation and for communication in our discussion. In our cost analysis below, we do not make those assumptions of low computational cost for f and g result from the special cases such as g(w) = w 2 /2 we discussed above. Instead, we only assume that the computation of g (v) and g (v) costs O(n), as further acceleration is a case-by-case possibility depending on the specific function structure. Therefore, as we do not assume any further structure of the problem, we only discuss the backtracking line search variant. The part of 2 g (v) is assumed to be negligible, for otherwise we will simply replace it with a multiple of the identity matrix as discussed in Section

10 We also assume that the cost of evaluating one ξi is proportional to the dimension of the domain, namely O(c i ), so the evaluation of ξ costs O(N/K) on each machine. We first check the cost of forming the problem (8). Note that we do not explicitly compute the values of B t and G (α t ). Instead, we only compute G(α t ) T α t through g (v t ) T v t, and the part ( α) T B t α under the choice (15) is obtained through α 2 and ( v t ) T 2 g (v t ) v t. Therefore, we only need to compute g (v t ) in this step, which costs O(n) under our assumption given that v t is already available on all machines. From v t to calculate g (v t ) costs O(n), and from our assumption it takes the same effort to get 2 g (v t ). Thus forming the problem (8) costs O(n) in computation and no communication is involved. Next, the cost of solving (8) is O(#nnz/K), as noted in most state-of-the-art single-core optimization methods for the dual ERM,for example, Hsieh et al. (2008); Yu et al. (2011), which is the cost of passing through the local data for a constant times. This part also involves no communication. For the line search part, as discussed in Section 3.2, we first need to make v t available on all machines. The calculation of v t through (17) costs O(#nnz/K) computation time, and since the vector is of size n, it also takes O(n) communication cost to gather information from all machines. Note that usually the value v t is also maintained throughout the process of solving the sub-problem (8) so it might not need additional costs to recompute it. Since here the asymptotic complexity is the same as solving the sub-problem, whether counting it or not does not affect the result. Now since v t is available on all machines, and we have g (v t ) from the part of forming the sub-problem, we can calculate the first term of (18) in O(n/K) computational cost and O(1) communication cost. The term related to ξ is a sum over l individual functions and therefore costs O(l/K), but it usually requires no more computational cost as it can be easily maintained in the process of solving the sub-problems. Then summing them up requires an O(1) communication that can be combined with the communication for obtaining g (v t ) T v t. Given v t and v t, for each evaluation of f under different η, it takes only O(n) to compute v t + η v t and to evaluate the corresponding g, while the part of ξ costs only O(N/K) in computation as it is a sum over l individual functions, and the required communication is O(1) to sum the functions up. Therefore, each backtracking line search iteration costs O(n + N/K) computation and O(1) communication. Finally, from (6), the vector w(α t ) is the same as the gradient vector we need in (8), so there is no additional cost to obtain the primal iterate, and evaluating the primal objective costs O(n) for g(w(α t )) and O(#nnz/K) for X T w(α t ) in computation. Thus the cost of the primal objective computation is O(#nnz/K + n). It also takes O(1) communication to gather the summation over ξ i. By assuming that each row and each column of X has at least one non-zero entry (for otherwise we can simply remove that row or column), we have n + N = O(#nnz). Thus in summary, each iteration of Algorithm 2 costs ( ( ) ) #nnz N O K + n + K + n #(line search) in computation and O (n + #(line search)) in communication. We will show in the next section that the number of line search is upper-bounded by a constant, so the overall cost per iteration is O(#nnz/K + n) computationally and O(n) for communication. 4. Analysis In this section, we provide theoretical worst case guarantees for our algorithm. Note that these worst cases barely happen empirically, and the analysis only serve as a certificate that even in these cases, our algorithm converges with a certain rate, but not to suggest that our algorithm always 10

11 converges as slowly. The focus of this work is not to tune the choice of B t that results in the best theoretical results, which will reduce the algorithm to a simple proximal gradient method, but to propose a practical algorithm that works well empirically and has a worst case guarantee. Therefore, the theoretical results may seem no better than the proximal gradient method, but we will show in Section 7 that the practical behavior of our algorithm is better than state of the art. The reason is that existing iteration complexity analyses are all determined by one iteration that has the worst possible progress in theory in the whole optimization procedure, but not depicting the overall behavior of an optimization method that barely encounters the worst case in reality. We start from showing that the update direction is indeed a descent one, and the backtracking line search procedure terminates within a bounded number of steps. Lemma 4.1 Given the current iterate α t, if B t is chosen so that Q αt B t is c 2 -strongly convex for some c 2 > 0 and c 1 + c 2 > 0, where c 1 is the smallest eigenvalue of B t, then the solution α t obtained by solving (8) is a descent direction for f at α t, namely t < 0. In particular, t c 1 + c 2 α t 2 0. (22) 2 Moreover, the backtracking line search procedure in Algorithm 1 generates a step size satisfying (9) within max(0, log β (1 τ) σ(c 1 + c 2 )/ X T X ) iterations, and the step size is lower bounded by ( η t min 1, β(1 τ)σ(c ) 1 + c 2 ) X T. (23) X Proof For any λ [0, 1], we have ( Q αt B t α t ) ( Q αt B t λ α t ) ( λq αt B t α t ) + (1 λ) Q αt B t (0) c 2λ (1 λ) α t 2 ( 2 = λ G ( α t) T α t + 1 ( α t ) T Bt α t + ξ ( α t α t)) + (1 λ) ξ ( α t) 2 c 2λ (1 λ) 2 α t 2, where in the first inequality we used that α t is the minimizer of (8), and in the second one we used the strong convexity of Q αt B t (see (4)). Rearranging this inequality gives us (1 λ) ( G ( α t) T α t + ξ ( α t α t) ξ ( α t)) (λ 1) ( α t ) T Bt α t c 2λ (1 λ) α t Divide both sides by (1 λ), we get G ( α t) T α t + ξ ( α t α t) ξ ( α t) 1 ( α t ) T Bt α t c 2λ α t 2 c 1 λc 2 α t Let λ 1, we obtain (22). Note in the last inequality of (22), equality holds if and only if α t = 0. This implies G (α t ) T d + ξ ( α t d) ξ ( α t ) > 0 for all d 0, meaning that α t is the optimum of (2). Therefore, (22) indicates that α t is a descent direction when the current iterate is not optimal, because from the Taylor expansion of G we have f ( α t + η α t) f ( α t) = η G ( α t) T α t + O ( η 2) + ξ ( ( α t + η α t)) ξ ( α t) η G ( α t) T α t + O ( η 2) + η ( ξ ( ( α t + α t)) ξ ( α t)) = η t + O ( η 2), 11

12 which implies that when η is small enough and positive, f(α t + η α t ) will be smaller than f(α t ). Regarding the backtracking termination, from the (1/σ)-Lipschitz continuity of g mentioned in Section 2 and the convexity, we have that for any η [0, 1] and any vector d, f ( α t + ηd ) f ( α t) = G ( α t + ηd ) G ( α t) + ξ ( α t ηd ) ξ ( α t) η ( G ( α t) T d + ξ ( α t d ) ξ ( α t)) + η2 Xd 2. (24) 2σ Take d = α t in (24), we see from (22) that it suffices to have η [0, 1] such that η t + η2 X T X 2σ α t 2 η t + η2 X T X 2σ 2 t ητ t (25) c 1 + c 2 to satisfy the Armijo rule (9). We directly see that (25) holds for any η [0, 1] satisfying η (1 τ)σ(c 1 + c 2 ) X T. X Therefore, we have that under Algorithm 1, the step size generated is guaranteed to be ( η t min 1, β(1 τ)σ(c ) 1 + c 2 ) X T. X The lower bound of the step size stated in Lemma 4.1 is proportional to the inverse of the Lipschitz constant for f and can be very small in theory. However, the readers should note that this is a worst-case guarantee that may be much smaller than the real step size obtained by Algorithm 1 in real-world problems. Later in Section 7.3, we show empirically that indeed the obtained step sizes are mostly close to 1. In addition, as discussed in Section 3.2, the cost of backtracking line search is negligible. Therefore, even if the worst case happened, the time spent on backtracking line search will not affect the overall efficiency of our algorithm. This is verified empirically in Section 7.3 as well. We now establish the convergence in the dual problem (2), and will use this result to deduce the convergence in (1) of the primal iterates obtained by (6), which is the kind of convergence we really concern when solving ERM problems. We consider the case that in addition to Assumption 2.1, either of the following assumptions holds. Assumption 4.2 The function ξ is differentiable and its gradient is ρ-lipschitz continuous for some ρ > 0. That is, ξ(z 1 ) ξ(z 2 ) ρ z 1 z 2, z 1, z 2. Assumption 4.3 The loss function ξ is L-Lipschitz continuous for some L > 0. ξ(z 1 ) ξ(z 2 ) L z 1 z 2, z 1, z 2. Moreover, the dual problem satisfies the Kurdyka- Lojasiewicz inequality with exponent 1/2 ( Lojasiewicz, 1963, 1993; Kurdyka, 1998) for some µ > 0. That is, f(α) f min ŝ f(α) ŝ 2 2µ = min s ξ ( α) G(α) + s 2, α Ω. (26) 2µ where f is the optimum objective value of the dual problem (2), and ξ ( α) is the set of subdifferential of ξ at α. 12

13 These assumptions made on the sum ξ are less strict than demanding each ξ i to satisfy certain properties. Note that from (Hiriart-Urruty and Lemaréchal, 2001, Part E, Theorem 4.2.1), when Assumption 4.2 holds, ξ is (1/ρ)-strongly convex. We note that the conditions in Lemma 4.1 is weaker than that of most proximal Newton-type methods such as Tseng and Yun (2009); Lee et al. (2014) as we do not need B t to be positive definite, in which case Q αt B t is still strongly convex when Assumption 4.2 holds. In this situation we can have a broader choice of B t. In Lemma 4.1, consider the choice of B t in (15), since H αt is positive semidefinite, we have that c 1 = a t 2. For c 2, if Assumption 4.2 holds, then since ξ is (1/ρ)-strongly convex, we have that c 2 = c 1 + 1/ρ, and otherwise c 2 = c 1. We first show that Assumption 4.2 implies that (26) holds for the dual problem, by noting that as discussed above, this assumption implies that the dual objective is strongly convex. Lemma 4.4 If f is µ-strongly convex as defined in (4), then it satisfies (26) with the same µ. Proof For any λ (0, 1], and any α 1, α 2, from the definition of strong convexity we have that f(α 1 ) f(α 2 ) µ(1 λ) α 1 α 2 2 f(λα 1 + (1 λ)α 2 ) f(α 2 ) s T (α 1 α 2 ), 2 λ for all s f(α 2 ), where the last inequality is from convexity of f. Let λ 0 +, we have that f(α 1 ) f(α 2 ) s T (α 1 α 2 ) + µ 2 α 1 α 2 2, s f(α 2 ). (27) Now we fix α 2 = α Ω, and minimize both sides of (27) with respect to α 1 to get f f(α) s 2, s f(α). 2µ Since it holds for all s f(α), we have proven (26) with parameter µ because f(α) = G(α) + ξ( α). We need the following definition in our convergence discussion. Definition 4.5 Given any optimization problem min x X f(x) (28) whose minimum is attainable and denoted by f, we say that x X is an ɛ-accurate solution for (28) if f(x) f ɛ. Now we are ready to show the global linear convergence of our algorithm for solving (2). Theorem 4.6 If (26) holds with µ > 0 for the objective of (2), there exists c 3 > 0 such that B t c 3 for all t, and that the conditions in Lemma 4.1 are also satisfied for all iterations for some c 1 c 3 and c 2, then the iterates generated by Algorithm 2 converges Q-linearly, and to obtain an ɛ-accurate solution for (2), it takes at most { (( ( X T ) 2 ) ( ) ) X max O + c σ µ(c 1 + c 2 )τ log, ɛ (( ( X T ) 2 ) X O + c 2 2 X T ( ) )} X 1 3 σ σµ(c 1 + c 2 ) 2 βτ(1 τ) log ɛ iterations. 13

14 Proof We first show the result for the variant of using backtracking line search. From (7), (22) and (9), we have that f(α t+1 ) f(α t ) η t τ c 1 + c 2 α t 2. (29) 2 From the optimality of α t in (8), we get that G (α t ) + B t α t + s t+1 = 0, (30) for some s t+1 ξ ( α t α t ). By convexity, that the step size is in [0, 1], and the condition (26), we have f ( α t+1) f η t ( f ( α t + α t) f ) + (1 η t ) ( f ( α t) f ) η t G ( α t + α t) + s t+1 2 2µ Now to relate the first term to the decrease, we use (30) to get + (1 η t ) ( f ( α t) f ). (31) G(α t + α t ) + s t+1 2 G (α t + α t ) G (α t ) + G (α t ) + s t G (α t + α t ) G (α t ) B t α t 2 ( X T ) 2 X 2 α t B t 2 α t 2, (32) σ where in the second inequality, we used (a + b) 2 2(a 2 + b 2 ) for all a, b, and in the last inequality we used Lipschitz continuity of G. We therefore get the following by combining (31), (32), and (29). f ( ( ( X α t+1) f η t T X µ σ ( ( X T X σ ) 2 + c 2 3 ) 2 + c 2 3 ) ) α t 2 + (1 η t ) ( f ( α t) f ) 2 ( ( f α t ) f ( α t+1)) + (1 η t ) ( f ( α t) f ). µ(c 1 + c 2 )τ (33) Let us define c 4 := ( ( X T X σ ) 2 + c 2 3 ) 2 µ (c 1 + c 2 ) τ, then rearranging (33) gives (f(α t+1 ) f ) (1 η t + c 4 ) 1 + c 4 (f(α t ) f ). (34) Combine the above result with the lower bound of η t from (23) proves the global Q-linear convergence, and the number of iterations needed to get an ɛ-accurate solution can be directly deduced from (34). Now consider the exact line search variant. We have that this variant results in an objective no larger than the left-hand side of (34) so all the results hold for the exact line search variant as well, thus finishing the proof. From Lemma 4.4, both Assumptions 4.2 and 4.3 satisfy the conditions required in Theorem 4.6. We note that in the final expression (34), a larger step size suggests a faster convergence for a fixed c 4. Therefore, it makes sense to conduct line search to try to find a larger η t instead of using a fixed and conservative one that guarantees (9) can be satisfied. On the other hand, one can tune the 14

15 value of c 4 by selecting different B t, but we should note that a B t that makes c 1 and c 2 larger may also make c 3 larger, and the contribution of c 3 to c 4 is quadratic, thus the choice should be made cautiously. However, the readers should note that the results in Theorem 4.6 are just the worst case iteration complexity, and empirically we often observe much faster convergence. In particular, this type of methods that consider a quadratic approximation of the smooth term normally result in global iteration complexity worse even than the non-accelerated gradient method as the proof techniques utilize (implicitly) the relation between the obtained update direction and the direction used in the gradient method. Moreover, this sort of iteration complexity result is controlled by the worst case of function value decrease in one iteration, so it is not a good predictor of the overall behavior. For example, the global iteration complexity of the Newton method and the quasi-newton methods are all worse than first-order methods in theory as they only have better local convergence rate, but in practice they are much faster because the worst case is extremely rare in real-world problems. This situation also holds true for these distributed optimization methods for (2). For example, the iteration complexity shown in Ma et al. (2017) is no better than a plain distributed implementation of the proximal gradient method, but the empirical performance is much better. Similarly, if we want to optimize c 4, then the best choice for (15) is a t 1 = 0, a t 2 = O( X T X /σ), which reduces our algorithm to the proximal gradient method. However, as one can expect, if the choice of B t is closer to the real Hessian of g then the empirical convergence should be faster as the approximation models the real function value decrease behavior better. The next results link the above linear convergence result for the dual problem to the convergence rate of the primal problem (1). Theorem 4.7 If Assumption 4.2 holds, then for any ɛ > 0 and any ɛ-accurate solution α for (2), the corresponding w obtained through (6) is (ɛ(1 + ρ X T X /σ))-accurate. If Assumption 4.3 holds, then for any ɛ > 0 and any ɛ-accurate solution α for (2), the corresponding w obtained through (6) is (min{4 X T X L 2 /σ, 8ɛ X T X L 2 /σ})-accurate. Proof Our proof consists of using α as the initial point, applying one step of some primal-dual algorithm, then utilizing the algorithm-specific relation between the decrease in one iteration and the duality gap to obtain the bound. Finally, we notice that the decrease of the dual objective in one iteration of any algorithm is upper-bounded by the distance to the optimum from the current objective, and that the primal sub-optimality is upper-bounded by the duality gap. Therefore we will obtain an algorithm-independent result from some algorithm-specific results. When Assumption 4.2 holds, the primal problem is in the type of problems considered in Shalev- Shwartz and Zhang (2012), and we have that ξ is (1/ρ)-strongly convex. If we take α as the initial point, and apply one step of their method to obtain the next iterate α +, from (Shalev-Shwartz and Zhang, 2012, Lemma 1), we get ɛ = f (α) f (α ) f (α) f ( α +) where w is the optimal solution of (1), s ( f P (w(α)) + f (α) ) s2 G s 2σ s ( f P (w(α)) f P (w ) ) s2 G s, s [0, 1], (35) 2σ G s := ( X T X σ(1 s) ) u α 2, sρ and u i ξ i (X T i w(α)). 15

16 To remove the second term in (35), we set This then gives X T X σ(1 s) sρ = 0 s = σ σ + ρ X T [0, 1]. X (1 + ρ XT X )ɛ f P (w(α)) f P (w ). σ Note that in (Shalev-Shwartz and Zhang, 2012, Lemma 1), the result is for the expected value of the dual objective decrease at the current iteration and the expected duality gap of the previous iteration. However, for the initial point, the expected duality gap is a constant, and the expected function decrease cannot exceed the distance from the current objective to the optimum. When Assumption 4.3 holds, the primal problem falls in the type of problems discussed in Bach (2015). If we take α as the initial point, and apply one step of their method to obtain the next iterate α +, from the last inequality in the proof of Proposition 4.2 in Bach (2015) and weak duality, where ɛ s ( f P (w(α)) f P (w ) ) (sr)2, s [0, 1], (36) 2σ R 2 = max α, ˆα Ω X(α ˆα) 2 X T X max α, ˆα Ω (α ˆα) 2 = 4 X T X L 2. (37) In the last equality we used (Rockafellar, 1970, Corollary ) such that if φ( ) is L-Lipschitz continuous, then the radius of dom(φ ) is no larger than L. The right-hand side of (36) is concave with respect to s, hence we can obtain the maximum of it by setting the partial derivative with respect to s to zero. By defining the maximizer as ŝ, this gives ŝ = arg max s [0,1] s ( f P (w) f P (w ) ) (sr)2 2σ = min{1, σ(f P (w) f P (w )) R 2 }. If ŝ = 1, we have and thus On the other hand, if ŝ 1, we get R 2 σ(f P (w) f P (w )), f P (w) f P (w ) R2 σ. 2R 2 ɛ σ (f P (w) f P (w )) 2. These conditions and (37) indicate that f P (w) f P (w ) min{ R2 2ɛR σ, 2 σ } X L 2 8ɛ XT X L min{4 XT, 2 }. σ σ Corollary 4.8 If we apply Algorithm 2 to solve a regularized ERM problem that satisfies either Assumption 4.2 or Assumption 4.3, then the primal iterates w t obtained from the dual iterates α t via (6) converges R-linearly. 16

17 Proof If Assumption 4.2 holds, Theorem 4.6 shows that the dual objective converges Q-linearly. Therefore, given any ɛ > 0, from Theorem 4.7, it suffices to have an (ɛ(1 + ρ X T X /σ))-accurate dual solution to transform to an ɛ-accurate primal solution, which takes ( ( )) ( 1 O log = O log 1 ) ɛ ɛ(σ+ρ X 2 ) σ iterations, showing R-linear convergence of the primal iterates. Note that here we omit all factors independent of ɛ to show the linear rate with respect to ɛ, while the number of iterations needed can be calculated by combining Theorems 4.6 and 4.7. On the other hand, if Assumption 4.3 holds, we assume ɛ < 4 X 2 L/σ, for otherwise it does not take any iteration to get an ɛ-accurate solution for the primal problem since α = 0 gives the required accuracy. Therefore, from Theorem 4.7 we need an O(ɛ 2 )-accurate dual solution to make the corresponding primal solution ɛ-accurate. Thus by Theorem 4.6, we need iterations. O ( log ( 1 ɛ 2 )) = O ( 2 log 1 ɛ ) = O ( log 1 ɛ ) The results in Theorem 4.7 and Corollary 4.8 are implied by existing works (Bach, 2015; Shalev- Shwartz and Zhang, 2012), and the calculations take very little effort. It is also not too difficult to obtain sublinear rates following similar techniques for problems not satisfying (26), but we omit them to simplify and focus our description. Our analysis above discusses the situation of exactly minimizing (8) at each round, while finding the exact solution when B t is not diagonal may be impractical. However, for most algorithms that one wants to use for solving (8), provided that the objective function is strongly convex, global linear convergence can be obtained. Therefore, by the convergence rate of the algorithm applied to solve (8), we can ensure that the problem is solved at least to some extent such that the objective is at least smaller than a certain negative value depending on the optimal objective value of (8). One can then ensure that in this case t is still bounded by some amount related to the norm of the obtained update direction, and thus the approximate solution is still a descent direction. Hence we can find another matrix B t such that the update direction is the solution of (8) with this matrix, because the degree of freedom of the matrix is large enough. Therefore the conditions in Lemma 4.1 and Theorem 4.6 are still satisfied. The key requirement here is that we need to control how exact or how loose the solutions of (8) at different iterations are in order to have lower- and upper-bounded B t over iterations to find c 1, c 2, c 3. We omit the proof details of these cases as the convergence theorems are just to provide a linear convergence guarantee regardless of the parameters in the worst case, and we also observe from the experiments that the empirical convergence speed is much faster than the theoretical one in the worst-case analysis. 5. Related Works Our algorithm can be viewed from two different aspects. If we simply consider solving (8), then it is similar to proximal (quasi-)newton methods, with some specific pick of the second-order approximation. A generalization of it appears as the block coordinate descent method (Tseng and Yun, 2009), where the proximal quasi-newton method is the special case that there is only one block of variables. One thing worth noticing is that Tseng and Yun (2009) requires the matrix in (8) to be positive definite with a positive lower bound of the smallest eigenvalue over all iterations. We relaxed this condition to allow B t be indefinite or positive semidefinite when the ξ term is strongly convex. This condition is used when Assumption 4.2 holds, and in this case we do not need to add a damping term in our second order approximation. Namely, we can set a t 2 0 in (15). 17

18 On the other hand, our focus is on how to devise a good approximation of the Hessian matrix of the smooth term to work efficiently for distributed optimization. Works focusing on this direction for dual ERM problems include Pechyony et al. (2011); Yang (2013); Ma et al. (2017). The work Pechyony et al. (2011) discusses how to solve the SVM dual problem distributedly. This problem is a special case of (2), see Section 6.1 for more details. They proposed a method called DSVM-AVE that iteratively solves (8) with B t defined in (15) with a t 1 1, a t 2 0 to obtain the update direction, while the step size η t is fixed to 1/K. Though they did not provide theoretical convergence guarantee in Pechyony et al. (2011), we can see the reasoning of this choice by (24). First, in the case of SVM, the objective is quadratic, with 2 g (v) I. Thus one can easily see that Xd 2 Kd T Hα d, d, (38) with the equality holds when x i, i = 1,..., N, are identical and d is a multiple of the vector of ones. Therefore, taking λ = 1/K in (24) and plug in the bound in (38), we can see that since σ = 1 in the SVM case, minimizing (8) with a step size of 1/K leads to decrease of the objective value. In Yang (2013), the algorithm DisDCA is proposed to solve (2) under the same assumption that g is strongly convex. They consider the case c i 1 for all i, but the algorithm can be directly generalized to c i > 1 easily. This method uses a specific algorithm, the stochastic dual coordinate descent (SDCA) method (Shalev-Shwartz and Zhang, 2013), to solve the local sub-problems, while the choice of B t is picked according to the algorithm parameters. To solve the sub-problem on machine k, each time SDCA samples one entry i k from J k with replacement and minimizes the local objective with respect to α ik. If each time machine k samples m k entries, and we denote m := K m k, then the first variant of DisDCA, called the basic variant in Yang (2013), picks B t in (8) as { m σ xt i x j if x i, x j are from the same X k for some k that is sampled, (B t ) i,j = 0 else, k=1 and the step size is fixed to 1. In this case, it is equivalent to splitting the data into l blocks, and the minimization is conducted only with respect to the blocks sampled. If we let I be the indices not sampled, then following the same reasoning of (38), we have Xd 2 d T B t d, d such that d I = 0, (39) where the equality holds when all X are identical and I = l m. Therefore, by combining (39) and (24), it is not hard to see that in this case minimizing Q α B t directly results in a certain amount of function value decrease. The analysis in Yang (2013) then shows that the primal iterates {w t } obtained by substituting the dual iterates {α t } into (6) converges linearly to the optimum when all ξ i have Lipschitz continuous gradient and converges with a rate of O(1/ɛ) when all ξ i are Lipschitz continuous by using some proof techniques similar to that in Shalev-Shwartz and Zhang (2012). As we noted in Section 4, this is actually the same as showing the convergence rate of the dual objective and then relate it to the primal objective. The key difference to our analysis is that they do not assume the further structure (26) of the dual problem when ξ is not strongly convex, ergo the sublinear rate. The second approach in Yang (2013), called the practical variant, considers { K (B t ) i,j = σ xt i x j if π(i) = π(j), 0 else, 18

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments