Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

Size: px
Start display at page:

Download "Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms"

Transcription

1 Exploiting Strong Convexit from Data with Primal-Dual First-Order Algorithms Jialei Wang Lin Xiao Abstract We consider empirical ris minimization of linear predictors with convex loss functions. Such problems can be reformulated as convex-concave saddle point problems and solved b primal-dual first-order algorithms. However, primal-dual algorithms often require explicit strongl convex regularization in order to obtain fast linear convergence, and the required dual proximal mapping ma not admit closed-form or efficient solution. In this paper, we develop both batch and randomized primal-dual algorithms that can exploit strong convexit from data adaptivel and are capable of achieving linear convergence even without regularization. We also present dual-free variants of adaptive primal-dual algorithms that do not need the dual proximal mapping, which are especiall suitable for logistic regression.. Introduction We consider the problem of regularized empirical ris minimization ERM of linear predictors. Let a,..., a n R d be the feature vectors of n data samples, φ i : R R be a convex loss function associated with the linear prediction a T i x, for i =,..., n, and g : Rd R be a convex regularization function for the predictor x R d. ERM amounts to solving the following convex optimization problem: { min P x def = } n x R d n i= φ ia T i x + gx. This formulation covers man well-nown classification and regression problems. For example, logistic regression is obtained b setting φ i z = log + exp b i z where b i {±}. For linear regression problems, the loss Department of Computer Science, The Universit of Chicago, Chicago, Illinois 60637, USA. Microsoft Research, Redmond, Washington 9805, USA. Correspondence to: Jialei Wang <jialei@uchicago.edu>, Lin Xiao <lin.xiao@microsoft.com>. Proceedings of the 34 th International Conference on Machine Learning, Sdne, Australia, PMLR 70, 07. Copright 07 b the authors. function is φ i z = /z b i, and we get ridge regression with gx = λ/ x and the elastic net with gx = λ x + λ / x. Let A = [a,..., a n ] T be the n b d data matrix. Throughout this paper, we mae the following assumptions: Assumption. The functions φ i, g and matrix A satisf: Each φ i is δ-strongl convex and /γ-smooth where γ > 0 and δ 0, and γδ ; g is λ-strongl convex where λ 0; λ + δµ > 0, where µ = λ min A T A. The strong convexit and smoothness mentioned above are with respect to the standard Euclidean norm, denoted as x = x T x. See, e.g., Nesterov 004, Sections.. and..3 for the exact definitions. We allow δ = 0, which simpl means φ i is convex. Let R = max i { a i } and assuming λ > 0, then R /γλ is a popular definition of condition number for analzing complexities of different algorithms. The last condition above means that the primal objective function P x is strongl convex, even if λ = 0. There have been extensive research activities in recent ears on developing efficientl algorithms for solving problem. A broad class of randomized algorithms that exploit the finite sum structure in the ERM problem have emerged as ver competitive both in terms of theoretical complexit and practical performance. The can be put into three categories: primal, dual, and primal-dual. Primal randomized algorithms wor with the ERM problem directl. The are modern versions of randomized incremental gradient methods e.g., Bertseas, 0; Nedic & Bertseas, 00 equipped with variance reduction techniques. Each iteration of such algorithms onl process one data point a i with complexit Od. The includes SAG Roux et al., 0, SAGA Defazio et al., 04, and SVRG Johnson & Zhang, 03; Xiao & Zhang, 04, which all achieve the iteration complexit O n + R /γλ log/ǫ to find an ǫ- optimal solution. In fact, the are capable of exploiting the strong convexit from data, meaning that the condition number R /γλ in the complexit can be replaced b the more favorable one R /γλ+δµ /n. This improvement can be achieved without explicit nowledge of µ from data.

2 Dual algorithms solve Fenchel dual of b maximizing D def = n n i= φ i i g n n i= ia i using randomized coordinate ascent algorithms. Here φ i and g denotes the conjugate functions of φ i and g. The include SDCA Shalev-Shwartz & Zhang, 03, Nesterov 0 and Richtári & Taáč 04. The have the same complexit O n + R /γλ log/ǫ, but cannot exploit strong convexit, if an when δµ > 0, from data. Primal-dual algorithms solve the convex-concave saddle point problem min x max Lx, where Lx, def = n n i= i a i, x φ i i + gx. 3 In particular, SPDC Zhang & Xiao, 05 achieves an accelerated linear convergence rate with iteration complexit O n + / γλ log/ǫ, which is better than the aforementioned non-accelerated complexit when R /γλ > n. Lan & Zhou 05 developed dual-free variants of accelerated primal-dual algorithms, but without considering the linear predictor structure in ERM. Balamurugan & Bach 06 extended SVRG and SAGA to solving saddle point problems. Accelerated primal and dual randomized algorithms have also been developed. Nesterov 0, Fercoq & Richtári 05 and Lin et al. 05b developed accelerated coordinate gradient algorithms, which can be applied to solve the dual problem. Allen-Zhu 06 developed an accelerated variant of SVRG. Acceleration can also be obtained using the Catalst framewor Lin et al., 05a. The all achieve the same O n + / γλ log/ǫ complexit. A common feature of accelerated algorithms is that the require good estimate of the strong convexit parameter. This maes hard for them to exploit strong convexit from data because the minimum singular value µ of the data matrix A is ver hard to estimate in general. In this paper, we show that primal-dual algorithms are capable of exploiting strong convexit from data if the algorithm parameters such as step sizes are set appropriatel. While these optimal setting depends on the nowledge of the convexit parameter µ from the data, we develop adaptive variants of primal-dual algorithms that can tune the parameter automaticall. Such adaptive schemes rel criticall on the capabilit of evaluating the primal-dual optimalit gaps b primal-dual algorithms. A major disadvantage of primal-dual algorithms is that the required dual proximal mapping ma not admit closedform or efficient solution. We follow the approach of Lan & Zhou 05 to derive dual-free variants of the primal-dual algorithms customized for ERM problems with the linear predictor structure, and show that the can also exploit strong convexit from data with correct choices of parameters or using an adaptation scheme. Algorithm Batch Primal-Dual BPD Algorithm input: parameters τ, σ, θ, initial point x 0 = x 0, 0 for t = 0,,,... do t+ = prox σf t + σa x t x t+ = prox τg x t τa T t+ x t+ = x t+ + θx t+ x t. Batch primal-dual algorithms We first stud batch primal-dual algorithms, b considering a batch version of the ERM problem, { def min x R d P x = fax + gx }. 4 where A R n d. We mae the following assumptions: Assumption. The functions f, g and matrix A satisf: f is δ-strongl convex and /γ-smooth where γ > 0 and δ 0, and γδ ; g is λ-strongl convex where λ 0; λ + δµ > 0, where µ = λ min A T A. Using conjugate functions, we can derive the dual of 4 as { def max R n D = f g A T }, 5 and the convex-concave saddle point formulation is { def min max Lx, = gx + T Ax f }. 6 x R d R n We consider the primal-dual first-order algorithm proposed b Chambolle & Poc 0; 06 for solving the saddle point problem 6, given in Algorithm, where prox ψ, for an convex function ψ : R n { }, is defined as prox ψ β = arg min α R n ψα + / α β. Assuming that f is smooth and g is strongl convex, Chambolle & Poc 0; 06 showed that Algorithm achieves accelerated linear convergence rate if λ > 0. However, the did not consider the case where additional or the sole source of strong convexit comes from fax. In the following theorem, we show how to set the parameters τ, σ and θ to exploit both sources of strong convexit to achieve fast linear convergence. Theorem. Suppose Assumption holds and x, is the unique saddle point of L defined in 6. Let L = A = λmax A T A. If we set the parameters in Algorithm as σ = λ+δµ L γ, τ = γ L λ+δµ, 7 and θ = max{θ x, θ } where θ x = δ µ δ+σ L +τλ, θ = +σγ/, 8

3 then we have τ + λ x t x + γ 4 t θ t C, Lx t, Lx, t θ t C, where C = τ + λ x 0 x + σ + γ 4 0. The proof of Theorem is given in Appendices B and C. Here we give a detailed analsis of the convergence rate. Substituting σ and τ in 7 into the expressions for θ and θ x in 8, and assuming γλ + δµ L, we have θ x γδµ L γλ+δµ L + γδ λ θ = + γλ+δµ /L γλ+δµ L. L γ λ+δµ, Since the overall condition number of the problem is L γλ+δµ, it is clear that θ is an accelerated convergence rate. Next we examine θ x in two special cases. The case of δµ = 0 but λ > 0. In this case, we have τ = L γ λ and σ = λ L γ, and thus θ x = + γλ/l γλ L, θ = + γλ/l γλ L. Therefore we have θ = max{θ x, θ } λγ L. This indeed is an accelerated convergence rate, recovering the result of Chambolle & Poc 0; 06. The case of λ = 0 but δµ > 0. τ = Lµ γ δ and σ = µ δ L γ, and In this case, we have θ x = γδµ L γδµ/l+γδ, θ γδµ L. L Notice that γδ µ is the condition number of fax. Next we assume µ L and examine how θ x varies with γδ. If γδ µ L, meaning f is badl conditioned, then θ x γδµ L 3 γδµ/l = γδµ 3L. Because the overall condition number is L γδ µ, this is an accelerated linear rate, and so is θ = max{θ x, θ }. If γδ µ L, meaning f is mildl conditioned, then θ x µ3 L 3 µ/l 3/ +µ/l µ L. This represents a half-accelerated rate, because the overall condition number is L γδ µ L3 µ. 3 If γδ =, i.e., f is a simple quadratic function, then θ x µ L µ/l+ µ L. This rate does not have acceleration, because the overall condition number is L γδ µ L µ. Algorithm Adaptive Batch Primal-Dual Ada-BPD input: problem constants λ, γ, δ, L and ˆµ > 0, initial point x 0, 0, and adaptation period T. Compute σ, τ, and θ as in 7 and 8 using µ = ˆµ for t = 0,,,... do t+ = prox σf t + σa x t x t+ = prox τg x t τa T t+ x t+ = x t+ + θx t+ x t if modt +, T == 0 then σ, τ, θ = BPD-Adapt {P s, D s } t+ s=t T Algorithm 3 BPD-Adapt simple heuristic input: previous estimate ˆµ, adaption period T, primal and dual objective values {P s, D s } t s=t T if P t D t < θ T P t T D t T then ˆµ := ˆµ else ˆµ := ˆµ/ Compute σ, τ, and θ as in 7 and 8 using µ = ˆµ output: new parameters σ, τ, θ In summar, the extent of acceleration in the dominating factor θ x which determines θ depends on the relative size of γδ and µ /L, i.e., the relative conditioning between the function f and the matrix A. In general, we have full acceleration if γδ µ /L. The theor predicts that the acceleration degrades as the function f gets better conditioned. However, in our numerical experiments, we often observe acceleration even if γδ gets closer to. As explained in Chambolle & Poc 0, Algorithm is equivalent to a preconditioned ADMM. Deng & Yin 06 characterized various conditions for ADMM to obtain linear convergence, but did not derive the convergence rate for the case we consider in this paper... Adaptive batch primal-dual algorithms In practice, it is often ver hard to obtain a good estimate of the problem-dependent constants, especiall µ = λmin A T A, in order to appl the algorithmic parameters specified in Theorem. Here we explore heuristics that can enable adaptive tuning of such parameters, which often lead to much improved performance in practice. A e observation is that the convergence rate of the BPD algorithm changes monotonicall with the overall convexit parameter λ + δµ, regardless of the extent of acceleration. In other words, the larger λ + δµ is, the faster the convergence. Therefore, if we can monitor the progress of

4 Algorithm 4 BPD-Adapt robust heuristic input: previous rate estimate ρ > 0, = δˆµ, period T, constants c < and c >, and {P s, D s } t s=t T Compute new rate estimate ˆρ = if ˆρ c ρ then :=, ρ := ˆρ else if ˆρ c ρ then := /, ρ := ˆρ else := λ+ γ, τ = L γ λ+ σ = L Compute θ using 8 or set θ = output: new parameters σ, τ, θ P t D t P t T D t T the convergence and compare it with the predicted convergence rate in Theorem, then we can adjust the estimated parameters to exploit strong convexit from data. More specificall, if the observed convergence rate is slower than the predicted rate, then we should reduce the estimate of µ; otherwise we should increase µ for faster convergence. We formalize the above reasoning in Algorithm called Ada-BPD. This algorithm maintains an estimate ˆµ of the true constant µ, and adjust it ever T iterations. We use P t and D t to represent the primal and dual objective values at P x t and D t, respectivel. We give two implementations of the tuning procedure BPD-Adapt: Algorithm 3 is a simple heuristic for tuning the estimate ˆµ, where the increasing and decreasing factor can be changed to other values larger than. Algorithm 4 is a more robust heuristic. It does not rel on the specific convergence rate θ established in Theorem. Instead, it simpl compares the current estimate of objective reduction rate ˆρ with the previous estimate ρ. It also specifies a non-tuning range of changes in ρ, specified b the interval [c, c]. The capabilit of accessing both the primal and dual objective values allows primal-dual algorithms to have good estimate of the convergence rate, which enables effective tuning heuristics. Automatic tuning of primal-dual algorithms have also been studied b, e.g., Malits & Poc 06 and Goldstein et al. 03, but with different goals. 3. Randomized primal-dual algorithm In this section, we come bac to the ERM problem and consider its saddle-point formulation in 3. Due to its finite sum structure in the dual variables i, we can develope randomized algorithms to exploit strong convexit from data. In particular, we extend the stochastic primal-dual coordinate SPDC algorithm b Zhang & Xiao 05. SPDC is Algorithm 5 Adaptive SPDC Ada-SPDC input: parameters σ, τ, θ > 0, initial point x 0, 0, and adaptation period T. Set x 0 = x 0 for t = 0,,,... do pic {,..., n} uniforml at random for i {,..., n} do if i == then t+ = prox σφ t + σat xt else t+ i = t i x t+ = prox τg x t τ u t + t+ u t+ = u t + n t+ t a x t+ = x t+ + θx t+ x t t a if modt +, T n = 0 then τ, σ, θ = SPDC-Adapt {P t sn, D t sn } T s=0 a special case of the Ada-SPDC algorithm in Algorithm 5, b setting the adaption period T = no adaption. The following theorem is proved in Appendix E. Theorem. Suppose Assumption holds. Let x, be the saddle point of the function L defined in 3, and R = max{ a,..., a n }. If we set T = in Algorithm 5 no adaption and let τ = 4R γ nλ+δµ, σ = 4R and θ = max{θ x, θ } where θ x = τσδµ nσ+4δ nλ+δµ γ, 9 +τλ, θ = +n /nσγ/ +σγ/, 0 then we have τ + [ λ E x t x ] + γ 4 E[ t ] θ t C, E [ Lx t, Lx, t ] θ t C, where C = τ + λ x 0 x + σ + γ 4 0. The expectation E[ ] is taen with respect to the histor of random indices drawn at each iteration. Below we give a detailed discussion on the expected convergence rate established in Theorem. The cases of σµ = 0 but λ > 0. τ = γ 4R nλ and σ = nλ 4R γ, and θ x = +τλ = +4R n/λγ, In this case we have θ = +n /nσγ/ +σγ/ = n+8r n/λγ.

5 Hence θ = θ. These recover the parameters and convergence rate of the standard SPDC Zhang & Xiao, 05. The cases of σµ > 0 but λ = 0. τ = γ 4Rµ δ and σ = µ δ 4R γ, and θ x = In this case we have τσδµ nσ+4δ = γδµ 3 γδµ/4r+4γδ. θ = n+8/µ γδµ γδ 8 + γδµ 8R. Since the objective is R /γ-smooth and δµ /n-strongl convex, θ is an accelerated rate if γδµ 8R otherwise θ n. For θ x, we consider different situations: If µ R, then we have θ x γδµ, which is an accelerated rate. So is θ = max{θ x, θ }. If µ < R and γδ µ R, then θ x γδµ, which represents an accelerated rate. The iteration complexit of SPDC is, which is better than that of Õ µ γδ SVRG in this case, which is Õ γδµ. If µ < R and γδ µ R, then we get θ x µ. This is a half-accelerated rate, because in this case SVRG 3 requires Õ µ iterations, versus Õ 3 µ for SPDC. If µ < R and γδ, meaning the φ i s are well conditioned, then we get θ x γδµ µ, which is a non-accelerated rate. The corresponding iteration complexit is the same as SVRG. 3.. Parameter adaptation for SPDC The SPDC-Adapt procedure called in Algorithm 5 follows the same logics as the batch adaption schemes in Algorithms 3 and 4, and we omit the details here. One thing we emphasize here is that the adaptation period T is in terms of epochs, or number of passes over the data. In addition, we onl compute the primal and dual objective values after each pass or ever few passes, because computing them exactl usuall need to tae a full pass of the data. Unlie the batch case where the dualit gap decreases monotonicall, the dualit gap for randomized algorithms can fluctuate wildl. So instead of using onl the two end values P t T n D t T n and P t D t, we can use more points to estimate the convergence rate through a linear regression. Suppose the primaldual objective values for the last T + passes are P 0, D0, P, D,..., P T, DT, and we need to estimate ρ rate per pass such that P t Dt ρ t P 0 D0, t =,..., T. We can turn it into a linear regression problem after taing logarithm and obtain the estimate ˆρ through T P t Dt logˆρ = + + +T t= t log P 0 D0. Algorithm 6 Dual-Free BPD Algorithm input: parameters σ, τ, θ > 0, initial point x 0, 0 Set x 0 = x 0 and v 0 = f 0 for t = 0,,,... do v t+ = vt +σa x t +σ, t+ = f v t+ x t+ = prox τg x t τa T t+ x t+ = x t+ + θx t+ x t 4. Dual-free Primal-dual algorithms Compared with primal algorithms, one major disadvantage of primal-dual algorithms is the requirement of computing the proximal mapping of the dual function f or φ i, which ma not admit closed-formed solution or efficient computation. This is especiall the case for logistic regression, one of the most popular loss functions used in classification. Lan & Zhou 05 developed dual-free variants of primal-dual algorithms that avoid computing the dual proximal mapping. Their main technique is to replace the Euclidean distance in the dual proximal mapping with a Bregman divergence defined over the dual loss function itself. We show how to appl this approach to solve the structured ERM problems considered in this paper. The can also exploit strong convexit from data if the algorithmic parameters are set appropriatel or adapted automaticall. 4.. Dual-free BPD algorithm First, we consider the batch setting. We replace the dual proximal mapping computing t+ in Algorithm with t+ { =arg min f T A x t + σ D, t }, where D is the Bregman divergence of a strictl convex ernel function h, defined as D h, t = h h t h t, t. Algorithm is obtained in the Euclidean setting with h = and D, t = t. Here we use f as the ernel function, and show that it allows us to compute t+ in ver efficientl. The following lemma explains the details Cf. Lan & Zhou, 05, Lemma. Lemma. Let the ernel h f in the Bregman divergence D. If we construct a sequence of vectors {v t } such that v 0 = f 0 and for all t 0, v t+ = vt +σa x t +σ, then the solution to problem is t+ = f v t+. Proof. Suppose v t = f t true for t = 0, then D, t = f f t v tt t.

6 The solution to can be written as { t+ = arg min f T A x t + σ f v tt } { = arg min + σ f } A x t + σ vt T = arg max = arg max { T v t +σa x t +σ f } } { v t+t f = f v t+, where in the last equalit we used the propert of conjugate function when f is strongl convex and smooth. Moreover, v t+ = f t+ = f t+, which completes the proof. According to Lemma, we onl need to provide initial points such that v 0 = f 0 is eas to compute. We do not need to compute f t directl for an t > 0, because it is can be updated as v t in. Consequentl, we can update t in the BPD algorithm using the gradient f v t, without the need of dual proximal mapping. The resulting dual-free algorithm is given in Algorithm 6. Theorem 3. Suppose Assumption holds and let x, be the unique saddle point of L defined in 6. If we set the parameters in Algorithm 6 as τ = γ L λ+δµ, σ = L γλ + δµ, 3 and θ = max{θ x, θ } where θ x = τσδµ 4+σ +τλ, θ = +σ/, 4 then we have τ + λ x t x + D, t θ t C, Lx t, Lx, t θ t C, where C = τ + λ x 0 x + σ + D, 0. Theorem 3 is proved in Appendices B and D. Assuming γλ + δµ L, we have θ x γδµ 6L λ γ γλ+δµ L λ+δµ, θ 4L. Again, we gain insights b consider the special cases: If δµ = 0 and λ > 0, then θ γλ 4L and θ x γλ L. So θ = max{θ x, θ } is an accelerated rate. γδµ If δµ > 0 and λ = 0, then θ 4L and θ x γδµ 6L. Thus θ = max{θ x, θ } γδµ 6L is not accelerated. This conclusion does not depends on the relative sizes of γδ and µ /L, and it is the major difference from the Euclidean case in Section. Algorithm 7 Adaptive Dual-Free SPDC ADF-SPDC input: parameters σ, τ, θ > 0, initial point x 0, 0, and adaptation period T. Set x 0 = x 0 and v 0 i = φ i 0 i for i =,..., n for t = 0,,,... do pic {,..., n} uniforml at random for i {,..., n} do if i == then else v t+ v t+ i x t+ = prox τg = vt +σat xt +σ, t+ = φ vt+ = v t i, t+ i = t i x t τ u t + t+ u t+ = u t + n t+ t a x t+ = x t+ + θx t+ x t t a if modt +, T n = 0 then τ, σ, θ = SPDC-Adapt {P t sn, D t sn } T s=0 If both δµ > 0 and λ > 0, then the extent of acceleration depends on their relative size. If λ is on the same order as δµ or larger, then accelerated rate is obtained. If λ is much smaller than δµ, then the theor predicts no acceleration. 4.. Dual-free SPDC algorithm Following the same approach, we can derive an Adaptive Dual-Free SPDC algorithm, given in Algorithm 7. On related wor, Shalev-Shwartz & Zhang 06 and Shalev-Shwartz, 06 introduced dual-free SDCA. The following theorem characterizes the choice of algorithmic parameters that can exploit strong convexit from data to achieve linear convergence proof given in Appendix F. Theorem 4. Suppose Assumption holds. Let x, be the saddle point of L in 3 and R=max{ a,..., a n }. If we set T = in Algorithm 7 non adaption and let σ = 4R γnλ + δµ, τ = γ 4R nλ+δµ, 5 and θ = max{θ x, θ } where θ x = τσδµ n4+σ +τλ, θ = +n /nσ/ +σ/, 6 then we have τ + λ E [ x t x ] + γ 4 E[ D, t ] θ t C, E [ Lx t, Lx, t ] θ t C, where C = τ + λ x 0 x + σ + D, 0.

7 Primal optimalit gap Primal AG BPD Opt-BPD Ada-BPD snthetic, λ = /n snthetic, λ = 0 /n snthetic, λ = 0 4 /n Figure. Comparison of batch primal-dual algorithms for a ridge regression problem with n = 5000 and d = Primal optimalit gap SVRG SAGA Katusha SPDC DF-SPDC ADF-SPDC cpuact, λ = /n cpuact, λ = 0 /n cpuact, λ = 0 4 /n Figure. Comparison of randomized algorithms for ridge regression problems. Now we discuss the results of Theorem 4 in further details. The cases of σµ = 0 but λ > 0. τ = γ 4R nλ and σ = 4R nγλ, and In this case we have θ x = +4R n/λγ, θ = n+8r n/λγ. The rate is the same as for SPDC in Zhang & Xiao 05. The cases of σµ > 0 but λ = 0. τ = γ 4Rµ δ and σ = 4R µ δγ, thus θ x = In this case we have τσδµ nσ+4 = γδµ 3 γδµ/4r+4, θ = +n /nσ/ +σ/ = n+8/µ γδ. We note that the primal function now is R /γ-smooth and δµ /n-strongl convex. We discuss the following cases: If γδµ > R, then we have θ x γδµ 8 and θ n. Therefore θ = max{θ x, θ } n. Otherwise, we have θ x γδµ 64 and θ is of the same order. This is not an accelerated rate, and we have the same iteration complexit as SVRG. Finall, we give concrete examples of how to compute the initial points 0 and v 0 such that v 0 i = φ i 0 i. For squared loss, φ i α = α b i and φ i β = β + b i β. So v 0 i = φ i 0 i = 0 i + b i. For logistic regression, we have b i {, } and φ i α = log + e biα. The conjugate function is φ i β = b iβ log b i β + + b i β log + b i β if b i β [, 0] and + otherwise. We can choose 0 i = b i and v 0 i =0 such that v 0 i =φ i 0 i. For logistic regression, we have δ = 0 over the full domain of φ i. However, each φ i is locall strongl convex in bounded domain Bach, 04: if z [ B, B], then we now δ = min z φ i z exp B/4. Therefore it is well suitable for an adaptation scheme similar to Algorithm 4 that do not require nowledge of either δ or µ. 5. Preliminar experiments We present preliminar experiments to demonstrate the effectiveness of our proposed algorithms. First, we consider batch primal-dual algorithms for ridge regression over a snthetic dataset. The data matrix A has sizes n = 5000 and d = 3000, and its entries are sampled from multivariate normal distribution with mean zero and covariance matrix Σ ij = i j /. We normalize all datasets

8 Primal optimalit gap SVRG SAGA Katusha SPDC DF-SPDC ADF-SPDC snthetic, λ = /n snthetic, λ = 0 /n snthetic, λ = 0 4 /n Primal optimalit gap SVRG SAGA Katusha SPDC DF-SPDC ADF-SPDC rcv, λ = /n rcv, λ = 0 /n rcv, λ = 0 4 /n Figure 3. Comparison of randomized algorithms for logistic regression problems. such that a i = a i / max j a j, to ensure the maximum norm of the data points is. We use l -regularization gx = λ/ x with three choices of parameter λ: /n, 0 /n and 0 4 /n, which represent the strong, medium, and wea levels of regularization, respectivel. Figure shows the performance of four different algorithms: the primal accelerated gradient Primal AG algorithm Nesterov, 004 using λ as strong convexit parameter, the BPD algorithm Algorithm that uses the same λ and µ δ = 0, the optimal BPD algorithm Opt-BPD that uses µ δ = λminat A n 0.0 n computed from data, and the Ada-BPD algorithm Algorithm with the robust adaptation heuristic Algorithm 4 with T = 0, c = 0.95 and c =.5. As expected, the performance of Primal-AG is ver similar to that of BPD, and Opt-BPD has the fastest convergence. The Ada-BPD algorithm can partiall exploit strong convexit from data without nowledge of µ. Next we compare DF-SPDC Algorithm 5 without adaption and ADF-SPDC Algorithm 7 with adaption against several state-of-the-art randomized algorithms for ERM: SVRG Johnson & Zhang, 03, SAGA Defazio et al., 04 Katusha Allen-Zhu, 06 and the standard SPDC method Zhang & Xiao, 05. For SVRG and Katusha an accelerated variant of SVRG, we choose the variance reduction period as m = n. The step sizes of all algorithms are set as their original paper suggested. For Ada-SPDC and ADF-SPDC, we use the robust adaptation scheme with T = 0, c = 0.95 and c =.5. We first compare these randomized algorithms for ridge regression over the cpuact data from the LibSVM website cjlin/libsvm/. The results are shown in Figure. With relativel strong regularization λ = /n, all methods perform similarl as predicted b theor. When λ becomes smaller, the nonaccelerated algorithms SVRG and SAGA automaticall exploit strong convexit from data, so the become faster than the non-adaptive accelerated methods Katusha, SPDC and DF-SPDC. The adaptive accelerated method, ADF-SPDC, has the fastest convergence. This indicates that our theoretical results, which predict no acceleration in this case, ma be further improved. Finall we compare these algorithms for logistic regression on the rcv dataset from LibSVM website and another snthetic dataset with n = 5000 and d = 500, generated similarl as before but with covariance matrix Σ ij = i j /00. For the standard SPDC, we compute the coordinate-wise dual proximal mapping using a few steps of scalar Newton s method to high precision. The dualfree SPDC algorithms onl use gradients of the logistic function. The results are presented in Figure 3. For both datasets, the strong convexit from data is ver wea, and the accelerated algorithms performs better.

9 References Allen-Zhu, Zeuan. Katusha: Accelerated variance reduction for faster sgd. ArXiv e-print , 06. Bach, Francis. Adaptivit of averaged stochastic gradient descent to local strong convexit for logistic regression. Journal of Machine Learning Research, 5:595 67, 04. Balamurugan, Palaniappan and Bach, Francis. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Sstems NIPS 9, pp , 06. Bertseas, Dimitri P. Incremental gradient, subgradient, and proximal methods for convex optimization: A surve. In Sra, Suvrit, Nowozin, Sebastian, and Wright, Stephen J. eds., Optimization for Machine Learning, chapter 4, pp MIT Press, 0. Chambolle, Antonin and Poc, Thomas. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:0 45, 0. Chambolle, Antonin and Poc, Thomas. On the ergodic convergence rates of a first-order primal dual algorithm. Mathematical Programming, Series A, 59:53 87, 06. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. Saga: A fast incremental gradient method with support for non-strongl convex composite objectives. In Advances in Neural Information Processing Sstems, pp , 04. Deng, Wei and Yin, Wotao. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 663: , 06. Fercoq, Oliver and Richtári, Peter. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 54:997 03, 05. Goldstein, Tom, Li, Min, Yuan, Xiaoming, Esser, Ernie, and Baraniu, Richard. Adaptive primal-dual hbrid gradient methods for saddle-point problems. arxiv preprint arxiv: , 03. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Sstems, pp , 03. Lan, Guanghui and Zhou, Yi. An optimal randomized incremental gradient method. arxiv preprint arxiv: , 05. Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A universal catalst for first-order optimization. In Advances in Neural Information Processing Sstems, pp , 05a. Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical ris minimization. SIAM Journal on Optimization, 54:44 73, 05b. Malits, Yura and Poc, Thomas. A first-order primal-dual algorithm with linesearch. arxiv preprint arxiv: , 06. Nedic, Angelia and Bertseas, Dimitri P. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, :09 38, 00. Nesterov, Yurii. Introductor Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 004. Nesterov, Yurii. Efficienc of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, :34 36, 0. Richtári, Peter and Taáč, Martin. Iteration complexit of randomized bloc-coordinate descent methods for minimizing a composite function. Mathematical Programming, 44-: 38, 04. Roux, Nicolas L, Schmidt, Mar, and Bach, Francis. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Sstems, pp , 0. Shalev-Shwartz, Shai. SDCA without dualit, regularization, and individual convexit. In Proceedings of The 33rd International Conference on Machine Learning, pp , 06. Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 4Feb: , 03. Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 55-: 05 45, 06. Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 44: , 04. Zhang, Yuchen and Xiao, Lin. Stochastic primal-dual coordinate method for regularized empirical ris minimization. In Proceedings of The 3nd International Conference on Machine Learning, pp , 05.

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

A. Derivation of regularized ERM duality

A. Derivation of regularized ERM duality A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

arxiv: v1 [math.oc] 5 Feb 2018

arxiv: v1 [math.oc] 5 Feb 2018 for Convex-Concave Saddle Point Problems without Strong Convexity Simon S. Du * 1 Wei Hu * arxiv:180.01504v1 math.oc] 5 Feb 018 Abstract We consider the convex-concave saddle point problem x max y fx +

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Stochastic Dual Coordinate Ascent with Adaptive Probabilities

Stochastic Dual Coordinate Ascent with Adaptive Probabilities Dominik Csiba Zheng Qu Peter Richtárik University of Edinburgh CDOMINIK@GMAIL.COM ZHENG.QU@ED.AC.UK PETER.RICHTARIK@ED.AC.UK Abstract This paper introduces AdaSDCA: an adaptive variant of stochastic dual

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Adaptive Primal Dual Optimization for Image Processing and Learning

Adaptive Primal Dual Optimization for Image Processing and Learning Adaptive Primal Dual Optimization for Image Processing and Learning Tom Goldstein Rice University tag7@rice.edu Ernie Esser University of British Columbia eesser@eos.ubc.ca Richard Baraniuk Rice University

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

SEGA: Variance Reduction via Gradient Sketching

SEGA: Variance Reduction via Gradient Sketching SEGA: Variance Reduction via Gradient Sketching Filip Hanzely 1 Konstantin Mishchenko 1 Peter Richtárik 1,2,3 1 King Abdullah University of Science and Technology, 2 University of Edinburgh, 3 Moscow Institute

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand,

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand, 1 Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server Arda Aytein Student Member, IEEE, Hamid Reza Feyzmahdavian, Student Member, IEEE, and Miael Johansson Member,

More information

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods

Katyusha: The First Direct Acceleration of Stochastic Gradient Methods Journal of Machine Learning Research 8 8-5 Submitted 8/6; Revised 5/7; Published 6/8 : The First Direct Acceleration of Stochastic Gradient Methods Zeyuan Allen-Zhu Microsoft Research AI Redmond, WA 985,

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem 1 Conve Analsis Main references: Vandenberghe UCLA): EECS236C - Optimiation methods for large scale sstems, http://www.seas.ucla.edu/ vandenbe/ee236c.html Parikh and Bod, Proimal algorithms, slides and

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

The basic problem of interest in this paper is the convex programming (CP) problem given by

The basic problem of interest in this paper is the convex programming (CP) problem given by Noname manuscript No. (will be inserted by the editor) An optimal randomized incremental gradient method Guanghui Lan Yi Zhou the date of receipt and acceptance should be inserted later arxiv:1507.02000v3

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

arxiv: v2 [cs.na] 3 Apr 2018

arxiv: v2 [cs.na] 3 Apr 2018 Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Fast-and-Light Stochastic ADMM

Fast-and-Light Stochastic ADMM Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-6) Fast-and-Light Stochastic ADMM Shuai Zheng James T. Kwok Department of Computer Science and Engineering

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

A first-order primal-dual algorithm with linesearch

A first-order primal-dual algorithm with linesearch A first-order primal-dual algorithm with linesearch Yura Malitsky Thomas Pock arxiv:608.08883v2 [math.oc] 23 Mar 208 Abstract The paper proposes a linesearch for a primal-dual method. Each iteration of

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS

QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS Journal of the Operations Research Societ of Japan 008, Vol. 51, No., 191-01 QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS Tomonari Kitahara Shinji Mizuno Kazuhide Nakata Toko Institute of Technolog

More information

On the Convergence of Federated Optimization in Heterogeneous Networks

On the Convergence of Federated Optimization in Heterogeneous Networks On the Convergence of Federated Optimization in Heterogeneous Networs Anit Kumar Sahu, Tian Li, Maziar Sanjabi, Manzil Zaheer, Ameet Talwalar, Virginia Smith Abstract The burgeoning field of federated

More information

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods

Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Tracking the gradients using the Hessian: A new look at variance reducing stochastic methods Robert M. Gower robert.gower@telecom-paristech.fr icolas Le Roux nicolas.le.roux@gmail.com Francis Bach francis.bach@inria.fr

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

arxiv: v6 [math.oc] 24 Sep 2018

arxiv: v6 [math.oc] 24 Sep 2018 : The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March

More information

arxiv: v2 [cs.lg] 14 Sep 2017

arxiv: v2 [cs.lg] 14 Sep 2017 Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017

More information