Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

Size: px

Start display at page:

Download "Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms"

Lawrence Allison
6 years ago
Views:

1 Exploiting Strong Convexit from Data with Primal-Dual First-Order Algorithms Jialei Wang Lin Xiao Abstract We consider empirical ris minimization of linear predictors with convex loss functions. Such problems can be reformulated as convex-concave saddle point problems and solved b primal-dual first-order algorithms. However, primal-dual algorithms often require explicit strongl convex regularization in order to obtain fast linear convergence, and the required dual proximal mapping ma not admit closed-form or efficient solution. In this paper, we develop both batch and randomized primal-dual algorithms that can exploit strong convexit from data adaptivel and are capable of achieving linear convergence even without regularization. We also present dual-free variants of adaptive primal-dual algorithms that do not need the dual proximal mapping, which are especiall suitable for logistic regression.. Introduction We consider the problem of regularized empirical ris minimization ERM of linear predictors. Let a,..., a n R d be the feature vectors of n data samples, φ i : R R be a convex loss function associated with the linear prediction a T i x, for i =,..., n, and g : Rd R be a convex regularization function for the predictor x R d. ERM amounts to solving the following convex optimization problem: { min P x def = } n x R d n i= φ ia T i x + gx. This formulation covers man well-nown classification and regression problems. For example, logistic regression is obtained b setting φ i z = log + exp b i z where b i {±}. For linear regression problems, the loss Department of Computer Science, The Universit of Chicago, Chicago, Illinois 60637, USA. Microsoft Research, Redmond, Washington 9805, USA. Correspondence to: Jialei Wang <jialei@uchicago.edu>, Lin Xiao <lin.xiao@microsoft.com>. Proceedings of the 34 th International Conference on Machine Learning, Sdne, Australia, PMLR 70, 07. Copright 07 b the authors. function is φ i z = /z b i, and we get ridge regression with gx = λ/ x and the elastic net with gx = λ x + λ / x. Let A = [a,..., a n ] T be the n b d data matrix. Throughout this paper, we mae the following assumptions: Assumption. The functions φ i, g and matrix A satisf: Each φ i is δ-strongl convex and /γ-smooth where γ > 0 and δ 0, and γδ ; g is λ-strongl convex where λ 0; λ + δµ > 0, where µ = λ min A T A. The strong convexit and smoothness mentioned above are with respect to the standard Euclidean norm, denoted as x = x T x. See, e.g., Nesterov 004, Sections.. and..3 for the exact definitions. We allow δ = 0, which simpl means φ i is convex. Let R = max i { a i } and assuming λ > 0, then R /γλ is a popular definition of condition number for analzing complexities of different algorithms. The last condition above means that the primal objective function P x is strongl convex, even if λ = 0. There have been extensive research activities in recent ears on developing efficientl algorithms for solving problem. A broad class of randomized algorithms that exploit the finite sum structure in the ERM problem have emerged as ver competitive both in terms of theoretical complexit and practical performance. The can be put into three categories: primal, dual, and primal-dual. Primal randomized algorithms wor with the ERM problem directl. The are modern versions of randomized incremental gradient methods e.g., Bertseas, 0; Nedic & Bertseas, 00 equipped with variance reduction techniques. Each iteration of such algorithms onl process one data point a i with complexit Od. The includes SAG Roux et al., 0, SAGA Defazio et al., 04, and SVRG Johnson & Zhang, 03; Xiao & Zhang, 04, which all achieve the iteration complexit O n + R /γλ log/ǫ to find an ǫ- optimal solution. In fact, the are capable of exploiting the strong convexit from data, meaning that the condition number R /γλ in the complexit can be replaced b the more favorable one R /γλ+δµ /n. This improvement can be achieved without explicit nowledge of µ from data.

2 Dual algorithms solve Fenchel dual of b maximizing D def = n n i= φ i i g n n i= ia i using randomized coordinate ascent algorithms. Here φ i and g denotes the conjugate functions of φ i and g. The include SDCA Shalev-Shwartz & Zhang, 03, Nesterov 0 and Richtári & Taáč 04. The have the same complexit O n + R /γλ log/ǫ, but cannot exploit strong convexit, if an when δµ > 0, from data. Primal-dual algorithms solve the convex-concave saddle point problem min x max Lx, where Lx, def = n n i= i a i, x φ i i + gx. 3 In particular, SPDC Zhang & Xiao, 05 achieves an accelerated linear convergence rate with iteration complexit O n + / γλ log/ǫ, which is better than the aforementioned non-accelerated complexit when R /γλ > n. Lan & Zhou 05 developed dual-free variants of accelerated primal-dual algorithms, but without considering the linear predictor structure in ERM. Balamurugan & Bach 06 extended SVRG and SAGA to solving saddle point problems. Accelerated primal and dual randomized algorithms have also been developed. Nesterov 0, Fercoq & Richtári 05 and Lin et al. 05b developed accelerated coordinate gradient algorithms, which can be applied to solve the dual problem. Allen-Zhu 06 developed an accelerated variant of SVRG. Acceleration can also be obtained using the Catalst framewor Lin et al., 05a. The all achieve the same O n + / γλ log/ǫ complexit. A common feature of accelerated algorithms is that the require good estimate of the strong convexit parameter. This maes hard for them to exploit strong convexit from data because the minimum singular value µ of the data matrix A is ver hard to estimate in general. In this paper, we show that primal-dual algorithms are capable of exploiting strong convexit from data if the algorithm parameters such as step sizes are set appropriatel. While these optimal setting depends on the nowledge of the convexit parameter µ from the data, we develop adaptive variants of primal-dual algorithms that can tune the parameter automaticall. Such adaptive schemes rel criticall on the capabilit of evaluating the primal-dual optimalit gaps b primal-dual algorithms. A major disadvantage of primal-dual algorithms is that the required dual proximal mapping ma not admit closedform or efficient solution. We follow the approach of Lan & Zhou 05 to derive dual-free variants of the primal-dual algorithms customized for ERM problems with the linear predictor structure, and show that the can also exploit strong convexit from data with correct choices of parameters or using an adaptation scheme. Algorithm Batch Primal-Dual BPD Algorithm input: parameters τ, σ, θ, initial point x 0 = x 0, 0 for t = 0,,,... do t+ = prox σf t + σa x t x t+ = prox τg x t τa T t+ x t+ = x t+ + θx t+ x t. Batch primal-dual algorithms We first stud batch primal-dual algorithms, b considering a batch version of the ERM problem, { def min x R d P x = fax + gx }. 4 where A R n d. We mae the following assumptions: Assumption. The functions f, g and matrix A satisf: f is δ-strongl convex and /γ-smooth where γ > 0 and δ 0, and γδ ; g is λ-strongl convex where λ 0; λ + δµ > 0, where µ = λ min A T A. Using conjugate functions, we can derive the dual of 4 as { def max R n D = f g A T }, 5 and the convex-concave saddle point formulation is { def min max Lx, = gx + T Ax f }. 6 x R d R n We consider the primal-dual first-order algorithm proposed b Chambolle & Poc 0; 06 for solving the saddle point problem 6, given in Algorithm, where prox ψ, for an convex function ψ : R n { }, is defined as prox ψ β = arg min α R n ψα + / α β. Assuming that f is smooth and g is strongl convex, Chambolle & Poc 0; 06 showed that Algorithm achieves accelerated linear convergence rate if λ > 0. However, the did not consider the case where additional or the sole source of strong convexit comes from fax. In the following theorem, we show how to set the parameters τ, σ and θ to exploit both sources of strong convexit to achieve fast linear convergence. Theorem. Suppose Assumption holds and x, is the unique saddle point of L defined in 6. Let L = A = λmax A T A. If we set the parameters in Algorithm as σ = λ+δµ L γ, τ = γ L λ+δµ, 7 and θ = max{θ x, θ } where θ x = δ µ δ+σ L +τλ, θ = +σγ/, 8

3 then we have τ + λ x t x + γ 4 t θ t C, Lx t, Lx, t θ t C, where C = τ + λ x 0 x + σ + γ 4 0. The proof of Theorem is given in Appendices B and C. Here we give a detailed analsis of the convergence rate. Substituting σ and τ in 7 into the expressions for θ and θ x in 8, and assuming γλ + δµ L, we have θ x γδµ L γλ+δµ L + γδ λ θ = + γλ+δµ /L γλ+δµ L. L γ λ+δµ, Since the overall condition number of the problem is L γλ+δµ, it is clear that θ is an accelerated convergence rate. Next we examine θ x in two special cases. The case of δµ = 0 but λ > 0. In this case, we have τ = L γ λ and σ = λ L γ, and thus θ x = + γλ/l γλ L, θ = + γλ/l γλ L. Therefore we have θ = max{θ x, θ } λγ L. This indeed is an accelerated convergence rate, recovering the result of Chambolle & Poc 0; 06. The case of λ = 0 but δµ > 0. τ = Lµ γ δ and σ = µ δ L γ, and In this case, we have θ x = γδµ L γδµ/l+γδ, θ γδµ L. L Notice that γδ µ is the condition number of fax. Next we assume µ L and examine how θ x varies with γδ. If γδ µ L, meaning f is badl conditioned, then θ x γδµ L 3 γδµ/l = γδµ 3L. Because the overall condition number is L γδ µ, this is an accelerated linear rate, and so is θ = max{θ x, θ }. If γδ µ L, meaning f is mildl conditioned, then θ x µ3 L 3 µ/l 3/ +µ/l µ L. This represents a half-accelerated rate, because the overall condition number is L γδ µ L3 µ. 3 If γδ =, i.e., f is a simple quadratic function, then θ x µ L µ/l+ µ L. This rate does not have acceleration, because the overall condition number is L γδ µ L µ. Algorithm Adaptive Batch Primal-Dual Ada-BPD input: problem constants λ, γ, δ, L and ˆµ > 0, initial point x 0, 0, and adaptation period T. Compute σ, τ, and θ as in 7 and 8 using µ = ˆµ for t = 0,,,... do t+ = prox σf t + σa x t x t+ = prox τg x t τa T t+ x t+ = x t+ + θx t+ x t if modt +, T == 0 then σ, τ, θ = BPD-Adapt {P s, D s } t+ s=t T Algorithm 3 BPD-Adapt simple heuristic input: previous estimate ˆµ, adaption period T, primal and dual objective values {P s, D s } t s=t T if P t D t < θ T P t T D t T then ˆµ := ˆµ else ˆµ := ˆµ/ Compute σ, τ, and θ as in 7 and 8 using µ = ˆµ output: new parameters σ, τ, θ In summar, the extent of acceleration in the dominating factor θ x which determines θ depends on the relative size of γδ and µ /L, i.e., the relative conditioning between the function f and the matrix A. In general, we have full acceleration if γδ µ /L. The theor predicts that the acceleration degrades as the function f gets better conditioned. However, in our numerical experiments, we often observe acceleration even if γδ gets closer to. As explained in Chambolle & Poc 0, Algorithm is equivalent to a preconditioned ADMM. Deng & Yin 06 characterized various conditions for ADMM to obtain linear convergence, but did not derive the convergence rate for the case we consider in this paper... Adaptive batch primal-dual algorithms In practice, it is often ver hard to obtain a good estimate of the problem-dependent constants, especiall µ = λmin A T A, in order to appl the algorithmic parameters specified in Theorem. Here we explore heuristics that can enable adaptive tuning of such parameters, which often lead to much improved performance in practice. A e observation is that the convergence rate of the BPD algorithm changes monotonicall with the overall convexit parameter λ + δµ, regardless of the extent of acceleration. In other words, the larger λ + δµ is, the faster the convergence. Therefore, if we can monitor the progress of

4 Algorithm 4 BPD-Adapt robust heuristic input: previous rate estimate ρ > 0, = δˆµ, period T, constants c < and c >, and {P s, D s } t s=t T Compute new rate estimate ˆρ = if ˆρ c ρ then :=, ρ := ˆρ else if ˆρ c ρ then := /, ρ := ˆρ else := λ+ γ, τ = L γ λ+ σ = L Compute θ using 8 or set θ = output: new parameters σ, τ, θ P t D t P t T D t T the convergence and compare it with the predicted convergence rate in Theorem, then we can adjust the estimated parameters to exploit strong convexit from data. More specificall, if the observed convergence rate is slower than the predicted rate, then we should reduce the estimate of µ; otherwise we should increase µ for faster convergence. We formalize the above reasoning in Algorithm called Ada-BPD. This algorithm maintains an estimate ˆµ of the true constant µ, and adjust it ever T iterations. We use P t and D t to represent the primal and dual objective values at P x t and D t, respectivel. We give two implementations of the tuning procedure BPD-Adapt: Algorithm 3 is a simple heuristic for tuning the estimate ˆµ, where the increasing and decreasing factor can be changed to other values larger than. Algorithm 4 is a more robust heuristic. It does not rel on the specific convergence rate θ established in Theorem. Instead, it simpl compares the current estimate of objective reduction rate ˆρ with the previous estimate ρ. It also specifies a non-tuning range of changes in ρ, specified b the interval [c, c]. The capabilit of accessing both the primal and dual objective values allows primal-dual algorithms to have good estimate of the convergence rate, which enables effective tuning heuristics. Automatic tuning of primal-dual algorithms have also been studied b, e.g., Malits & Poc 06 and Goldstein et al. 03, but with different goals. 3. Randomized primal-dual algorithm In this section, we come bac to the ERM problem and consider its saddle-point formulation in 3. Due to its finite sum structure in the dual variables i, we can develope randomized algorithms to exploit strong convexit from data. In particular, we extend the stochastic primal-dual coordinate SPDC algorithm b Zhang & Xiao 05. SPDC is Algorithm 5 Adaptive SPDC Ada-SPDC input: parameters σ, τ, θ > 0, initial point x 0, 0, and adaptation period T. Set x 0 = x 0 for t = 0,,,... do pic {,..., n} uniforml at random for i {,..., n} do if i == then t+ = prox σφ t + σat xt else t+ i = t i x t+ = prox τg x t τ u t + t+ u t+ = u t + n t+ t a x t+ = x t+ + θx t+ x t t a if modt +, T n = 0 then τ, σ, θ = SPDC-Adapt {P t sn, D t sn } T s=0 a special case of the Ada-SPDC algorithm in Algorithm 5, b setting the adaption period T = no adaption. The following theorem is proved in Appendix E. Theorem. Suppose Assumption holds. Let x, be the saddle point of the function L defined in 3, and R = max{ a,..., a n }. If we set T = in Algorithm 5 no adaption and let τ = 4R γ nλ+δµ, σ = 4R and θ = max{θ x, θ } where θ x = τσδµ nσ+4δ nλ+δµ γ, 9 +τλ, θ = +n /nσγ/ +σγ/, 0 then we have τ + [ λ E x t x ] + γ 4 E[ t ] θ t C, E [ Lx t, Lx, t ] θ t C, where C = τ + λ x 0 x + σ + γ 4 0. The expectation E[ ] is taen with respect to the histor of random indices drawn at each iteration. Below we give a detailed discussion on the expected convergence rate established in Theorem. The cases of σµ = 0 but λ > 0. τ = γ 4R nλ and σ = nλ 4R γ, and θ x = +τλ = +4R n/λγ, In this case we have θ = +n /nσγ/ +σγ/ = n+8r n/λγ.

5 Hence θ = θ. These recover the parameters and convergence rate of the standard SPDC Zhang & Xiao, 05. The cases of σµ > 0 but λ = 0. τ = γ 4Rµ δ and σ = µ δ 4R γ, and θ x = In this case we have τσδµ nσ+4δ = γδµ 3 γδµ/4r+4γδ. θ = n+8/µ γδµ γδ 8 + γδµ 8R. Since the objective is R /γ-smooth and δµ /n-strongl convex, θ is an accelerated rate if γδµ 8R otherwise θ n. For θ x, we consider different situations: If µ R, then we have θ x γδµ, which is an accelerated rate. So is θ = max{θ x, θ }. If µ < R and γδ µ R, then θ x γδµ, which represents an accelerated rate. The iteration complexit of SPDC is, which is better than that of Õ µ γδ SVRG in this case, which is Õ γδµ. If µ < R and γδ µ R, then we get θ x µ. This is a half-accelerated rate, because in this case SVRG 3 requires Õ µ iterations, versus Õ 3 µ for SPDC. If µ < R and γδ, meaning the φ i s are well conditioned, then we get θ x γδµ µ, which is a non-accelerated rate. The corresponding iteration complexit is the same as SVRG. 3.. Parameter adaptation for SPDC The SPDC-Adapt procedure called in Algorithm 5 follows the same logics as the batch adaption schemes in Algorithms 3 and 4, and we omit the details here. One thing we emphasize here is that the adaptation period T is in terms of epochs, or number of passes over the data. In addition, we onl compute the primal and dual objective values after each pass or ever few passes, because computing them exactl usuall need to tae a full pass of the data. Unlie the batch case where the dualit gap decreases monotonicall, the dualit gap for randomized algorithms can fluctuate wildl. So instead of using onl the two end values P t T n D t T n and P t D t, we can use more points to estimate the convergence rate through a linear regression. Suppose the primaldual objective values for the last T + passes are P 0, D0, P, D,..., P T, DT, and we need to estimate ρ rate per pass such that P t Dt ρ t P 0 D0, t =,..., T. We can turn it into a linear regression problem after taing logarithm and obtain the estimate ˆρ through T P t Dt logˆρ = + + +T t= t log P 0 D0. Algorithm 6 Dual-Free BPD Algorithm input: parameters σ, τ, θ > 0, initial point x 0, 0 Set x 0 = x 0 and v 0 = f 0 for t = 0,,,... do v t+ = vt +σa x t +σ, t+ = f v t+ x t+ = prox τg x t τa T t+ x t+ = x t+ + θx t+ x t 4. Dual-free Primal-dual algorithms Compared with primal algorithms, one major disadvantage of primal-dual algorithms is the requirement of computing the proximal mapping of the dual function f or φ i, which ma not admit closed-formed solution or efficient computation. This is especiall the case for logistic regression, one of the most popular loss functions used in classification. Lan & Zhou 05 developed dual-free variants of primal-dual algorithms that avoid computing the dual proximal mapping. Their main technique is to replace the Euclidean distance in the dual proximal mapping with a Bregman divergence defined over the dual loss function itself. We show how to appl this approach to solve the structured ERM problems considered in this paper. The can also exploit strong convexit from data if the algorithmic parameters are set appropriatel or adapted automaticall. 4.. Dual-free BPD algorithm First, we consider the batch setting. We replace the dual proximal mapping computing t+ in Algorithm with t+ { =arg min f T A x t + σ D, t }, where D is the Bregman divergence of a strictl convex ernel function h, defined as D h, t = h h t h t, t. Algorithm is obtained in the Euclidean setting with h = and D, t = t. Here we use f as the ernel function, and show that it allows us to compute t+ in ver efficientl. The following lemma explains the details Cf. Lan & Zhou, 05, Lemma. Lemma. Let the ernel h f in the Bregman divergence D. If we construct a sequence of vectors {v t } such that v 0 = f 0 and for all t 0, v t+ = vt +σa x t +σ, then the solution to problem is t+ = f v t+. Proof. Suppose v t = f t true for t = 0, then D, t = f f t v tt t.

6 The solution to can be written as { t+ = arg min f T A x t + σ f v tt } { = arg min + σ f } A x t + σ vt T = arg max = arg max { T v t +σa x t +σ f } } { v t+t f = f v t+, where in the last equalit we used the propert of conjugate function when f is strongl convex and smooth. Moreover, v t+ = f t+ = f t+, which completes the proof. According to Lemma, we onl need to provide initial points such that v 0 = f 0 is eas to compute. We do not need to compute f t directl for an t > 0, because it is can be updated as v t in. Consequentl, we can update t in the BPD algorithm using the gradient f v t, without the need of dual proximal mapping. The resulting dual-free algorithm is given in Algorithm 6. Theorem 3. Suppose Assumption holds and let x, be the unique saddle point of L defined in 6. If we set the parameters in Algorithm 6 as τ = γ L λ+δµ, σ = L γλ + δµ, 3 and θ = max{θ x, θ } where θ x = τσδµ 4+σ +τλ, θ = +σ/, 4 then we have τ + λ x t x + D, t θ t C, Lx t, Lx, t θ t C, where C = τ + λ x 0 x + σ + D, 0. Theorem 3 is proved in Appendices B and D. Assuming γλ + δµ L, we have θ x γδµ 6L λ γ γλ+δµ L λ+δµ, θ 4L. Again, we gain insights b consider the special cases: If δµ = 0 and λ > 0, then θ γλ 4L and θ x γλ L. So θ = max{θ x, θ } is an accelerated rate. γδµ If δµ > 0 and λ = 0, then θ 4L and θ x γδµ 6L. Thus θ = max{θ x, θ } γδµ 6L is not accelerated. This conclusion does not depends on the relative sizes of γδ and µ /L, and it is the major difference from the Euclidean case in Section. Algorithm 7 Adaptive Dual-Free SPDC ADF-SPDC input: parameters σ, τ, θ > 0, initial point x 0, 0, and adaptation period T. Set x 0 = x 0 and v 0 i = φ i 0 i for i =,..., n for t = 0,,,... do pic {,..., n} uniforml at random for i {,..., n} do if i == then else v t+ v t+ i x t+ = prox τg = vt +σat xt +σ, t+ = φ vt+ = v t i, t+ i = t i x t τ u t + t+ u t+ = u t + n t+ t a x t+ = x t+ + θx t+ x t t a if modt +, T n = 0 then τ, σ, θ = SPDC-Adapt {P t sn, D t sn } T s=0 If both δµ > 0 and λ > 0, then the extent of acceleration depends on their relative size. If λ is on the same order as δµ or larger, then accelerated rate is obtained. If λ is much smaller than δµ, then the theor predicts no acceleration. 4.. Dual-free SPDC algorithm Following the same approach, we can derive an Adaptive Dual-Free SPDC algorithm, given in Algorithm 7. On related wor, Shalev-Shwartz & Zhang 06 and Shalev-Shwartz, 06 introduced dual-free SDCA. The following theorem characterizes the choice of algorithmic parameters that can exploit strong convexit from data to achieve linear convergence proof given in Appendix F. Theorem 4. Suppose Assumption holds. Let x, be the saddle point of L in 3 and R=max{ a,..., a n }. If we set T = in Algorithm 7 non adaption and let σ = 4R γnλ + δµ, τ = γ 4R nλ+δµ, 5 and θ = max{θ x, θ } where θ x = τσδµ n4+σ +τλ, θ = +n /nσ/ +σ/, 6 then we have τ + λ E [ x t x ] + γ 4 E[ D, t ] θ t C, E [ Lx t, Lx, t ] θ t C, where C = τ + λ x 0 x + σ + D, 0.

7 Primal optimalit gap Primal AG BPD Opt-BPD Ada-BPD snthetic, λ = /n snthetic, λ = 0 /n snthetic, λ = 0 4 /n Figure. Comparison of batch primal-dual algorithms for a ridge regression problem with n = 5000 and d = Primal optimalit gap SVRG SAGA Katusha SPDC DF-SPDC ADF-SPDC cpuact, λ = /n cpuact, λ = 0 /n cpuact, λ = 0 4 /n Figure. Comparison of randomized algorithms for ridge regression problems. Now we discuss the results of Theorem 4 in further details. The cases of σµ = 0 but λ > 0. τ = γ 4R nλ and σ = 4R nγλ, and In this case we have θ x = +4R n/λγ, θ = n+8r n/λγ. The rate is the same as for SPDC in Zhang & Xiao 05. The cases of σµ > 0 but λ = 0. τ = γ 4Rµ δ and σ = 4R µ δγ, thus θ x = In this case we have τσδµ nσ+4 = γδµ 3 γδµ/4r+4, θ = +n /nσ/ +σ/ = n+8/µ γδ. We note that the primal function now is R /γ-smooth and δµ /n-strongl convex. We discuss the following cases: If γδµ > R, then we have θ x γδµ 8 and θ n. Therefore θ = max{θ x, θ } n. Otherwise, we have θ x γδµ 64 and θ is of the same order. This is not an accelerated rate, and we have the same iteration complexit as SVRG. Finall, we give concrete examples of how to compute the initial points 0 and v 0 such that v 0 i = φ i 0 i. For squared loss, φ i α = α b i and φ i β = β + b i β. So v 0 i = φ i 0 i = 0 i + b i. For logistic regression, we have b i {, } and φ i α = log + e biα. The conjugate function is φ i β = b iβ log b i β + + b i β log + b i β if b i β [, 0] and + otherwise. We can choose 0 i = b i and v 0 i =0 such that v 0 i =φ i 0 i. For logistic regression, we have δ = 0 over the full domain of φ i. However, each φ i is locall strongl convex in bounded domain Bach, 04: if z [ B, B], then we now δ = min z φ i z exp B/4. Therefore it is well suitable for an adaptation scheme similar to Algorithm 4 that do not require nowledge of either δ or µ. 5. Preliminar experiments We present preliminar experiments to demonstrate the effectiveness of our proposed algorithms. First, we consider batch primal-dual algorithms for ridge regression over a snthetic dataset. The data matrix A has sizes n = 5000 and d = 3000, and its entries are sampled from multivariate normal distribution with mean zero and covariance matrix Σ ij = i j /. We normalize all datasets

8 Primal optimalit gap SVRG SAGA Katusha SPDC DF-SPDC ADF-SPDC snthetic, λ = /n snthetic, λ = 0 /n snthetic, λ = 0 4 /n Primal optimalit gap SVRG SAGA Katusha SPDC DF-SPDC ADF-SPDC rcv, λ = /n rcv, λ = 0 /n rcv, λ = 0 4 /n Figure 3. Comparison of randomized algorithms for logistic regression problems. such that a i = a i / max j a j, to ensure the maximum norm of the data points is. We use l -regularization gx = λ/ x with three choices of parameter λ: /n, 0 /n and 0 4 /n, which represent the strong, medium, and wea levels of regularization, respectivel. Figure shows the performance of four different algorithms: the primal accelerated gradient Primal AG algorithm Nesterov, 004 using λ as strong convexit parameter, the BPD algorithm Algorithm that uses the same λ and µ δ = 0, the optimal BPD algorithm Opt-BPD that uses µ δ = λminat A n 0.0 n computed from data, and the Ada-BPD algorithm Algorithm with the robust adaptation heuristic Algorithm 4 with T = 0, c = 0.95 and c =.5. As expected, the performance of Primal-AG is ver similar to that of BPD, and Opt-BPD has the fastest convergence. The Ada-BPD algorithm can partiall exploit strong convexit from data without nowledge of µ. Next we compare DF-SPDC Algorithm 5 without adaption and ADF-SPDC Algorithm 7 with adaption against several state-of-the-art randomized algorithms for ERM: SVRG Johnson & Zhang, 03, SAGA Defazio et al., 04 Katusha Allen-Zhu, 06 and the standard SPDC method Zhang & Xiao, 05. For SVRG and Katusha an accelerated variant of SVRG, we choose the variance reduction period as m = n. The step sizes of all algorithms are set as their original paper suggested. For Ada-SPDC and ADF-SPDC, we use the robust adaptation scheme with T = 0, c = 0.95 and c =.5. We first compare these randomized algorithms for ridge regression over the cpuact data from the LibSVM website cjlin/libsvm/. The results are shown in Figure. With relativel strong regularization λ = /n, all methods perform similarl as predicted b theor. When λ becomes smaller, the nonaccelerated algorithms SVRG and SAGA automaticall exploit strong convexit from data, so the become faster than the non-adaptive accelerated methods Katusha, SPDC and DF-SPDC. The adaptive accelerated method, ADF-SPDC, has the fastest convergence. This indicates that our theoretical results, which predict no acceleration in this case, ma be further improved. Finall we compare these algorithms for logistic regression on the rcv dataset from LibSVM website and another snthetic dataset with n = 5000 and d = 500, generated similarl as before but with covariance matrix Σ ij = i j /00. For the standard SPDC, we compute the coordinate-wise dual proximal mapping using a few steps of scalar Newton s method to high precision. The dualfree SPDC algorithms onl use gradients of the logistic function. The results are presented in Figure 3. For both datasets, the strong convexit from data is ver wea, and the accelerated algorithms performs better.

9 References Allen-Zhu, Zeuan. Katusha: Accelerated variance reduction for faster sgd. ArXiv e-print , 06. Bach, Francis. Adaptivit of averaged stochastic gradient descent to local strong convexit for logistic regression. Journal of Machine Learning Research, 5:595 67, 04. Balamurugan, Palaniappan and Bach, Francis. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Sstems NIPS 9, pp , 06. Bertseas, Dimitri P. Incremental gradient, subgradient, and proximal methods for convex optimization: A surve. In Sra, Suvrit, Nowozin, Sebastian, and Wright, Stephen J. eds., Optimization for Machine Learning, chapter 4, pp MIT Press, 0. Chambolle, Antonin and Poc, Thomas. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:0 45, 0. Chambolle, Antonin and Poc, Thomas. On the ergodic convergence rates of a first-order primal dual algorithm. Mathematical Programming, Series A, 59:53 87, 06. Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. Saga: A fast incremental gradient method with support for non-strongl convex composite objectives. In Advances in Neural Information Processing Sstems, pp , 04. Deng, Wei and Yin, Wotao. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 663: , 06. Fercoq, Oliver and Richtári, Peter. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 54:997 03, 05. Goldstein, Tom, Li, Min, Yuan, Xiaoming, Esser, Ernie, and Baraniu, Richard. Adaptive primal-dual hbrid gradient methods for saddle-point problems. arxiv preprint arxiv: , 03. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Sstems, pp , 03. Lan, Guanghui and Zhou, Yi. An optimal randomized incremental gradient method. arxiv preprint arxiv: , 05. Lin, Hongzhou, Mairal, Julien, and Harchaoui, Zaid. A universal catalst for first-order optimization. In Advances in Neural Information Processing Sstems, pp , 05a. Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated randomized proximal coordinate gradient method and its application to regularized empirical ris minimization. SIAM Journal on Optimization, 54:44 73, 05b. Malits, Yura and Poc, Thomas. A first-order primal-dual algorithm with linesearch. arxiv preprint arxiv: , 06. Nedic, Angelia and Bertseas, Dimitri P. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, :09 38, 00. Nesterov, Yurii. Introductor Lectures on Convex Optimization: A Basic Course. Kluwer, Boston, 004. Nesterov, Yurii. Efficienc of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, :34 36, 0. Richtári, Peter and Taáč, Martin. Iteration complexit of randomized bloc-coordinate descent methods for minimizing a composite function. Mathematical Programming, 44-: 38, 04. Roux, Nicolas L, Schmidt, Mar, and Bach, Francis. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Sstems, pp , 0. Shalev-Shwartz, Shai. SDCA without dualit, regularization, and individual convexit. In Proceedings of The 33rd International Conference on Machine Learning, pp , 06. Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 4Feb: , 03. Shalev-Shwartz, Shai and Zhang, Tong. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 55-: 05 45, 06. Xiao, Lin and Zhang, Tong. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 44: , 04. Zhang, Yuchen and Xiao, Lin. Stochastic primal-dual coordinate method for regularized empirical ris minimization. In Proceedings of The 3nd International Conference on Machine Learning, pp , 05.

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract