A. Derivation of regularized ERM duality

A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as B convex dualit, this is equivalent to min max x,z i} R n φ i z i + x s + T Ax z = max n min x R d,z R n φ iz i + x s. subject to z i = a T i x, for i =,..., n = max = max = max = max = min The final negated problem is precisel the dual formulation. min x,z i} min x,z i} φ i z i + x s + T Ax z φ i z i i z i + x s + T Ax min φ i z i i z i } + min z i x max i z i φ i z i } max z i x φ i i AT + s T A T φ i i + AT s T A T. The first problem is a Lagrangian saddle-point problem, where the Lagrangian is defined as Lx,, z = φ i z i + x s + T Ax z. } x s + T Ax T Ax x s The dual-to-primal mapping 6 and primal-to-dual mapping 7 are implied b the KKT conditions under L, and can be derived b solving for x,, and z in the sstem Lx,, z = 0. The dualit gap in this context is defined as gap s, x, = f s, x + g s,. 9 Strong dualit dictates that gap s, x, 0 for all x R d, R n, with equalit attained when x is primal-optimal and is dual-optimal. B. Additional lemmas This section establishes technical lemmas, useful in statements and proofs throughout the paper. Lemma B. Standard bounds for smooth, strongl convex functions. Let f : R k R be differentiable and let x R k. Furthermore, let x opt be a minimizer of f. If f is L-smooth then L fx fx fx opt L x xopt }

If f is -strongl convex we have x xopt fx fx opt fx Proof. Appl the definition of smoothness and strong convexit at the points x and x opt and minimize the resulting quadratic form. Lemma B. Errors for primal-dual pairs. Consider F, f s, g s,, ˆx s,, and ŷ, as defined in, 3, 5, 6, and 7, respectivel. Then for all R n, Furthermore let f s, ˆx s, f opt s, R LnR L + g s, g opt s,, Then for all R n, x opt s, = argmin f s,x and opt s, = argmin g s,x. ˆx s, x opt s, nr opt s,. Proof. Because F is nr L smooth, f s, is nr L + smooth. Consequentl, for all x R d we have f s, x f opt s, xopt Since we know that x opt s, = s AT opt s, and AAT nr I, s, nr L + f s, ˆx x, f s, x opt s, nr L + x x opt s, s AT s AT opt s, = nr L + opt s, AA T nr nr L + opt s,. 0 Finall, since each φ i is /L strongl convex, G is n/l strongl convex and hence so is g s,. Therefore, Substituting in 0 ields the result. n opt s, L g s, g s, opt s,. B adding the dual error to both sides of the inequalit in Lemma B., we obtain the following corollar. Corollar B.3 Dual error bounds gap. Consider g s,, gap s,, ˆx s,, and ŷ, as defined in 5, 9, 6, and 7, respectivel. For all s R d and R n, Another corollar arises b combining Lemmas B. and B.3. gap s, ˆx s,, R LnR L + g s, g opt s,. Corollar B.4 Dual error bounds primal gradient. Consider f s,, g s,, and ˆx s,, as defined in 3, 5, and 6, respectivel. For all s R d and R n, f s, ˆx s, nr L + f s, ˆx s, f opt s, R LnR L + g s, g opt s,.

Lemma B.5 Gap for primal-dual pairs. Consider F, f s, g s,, gap s,, ˆx s,, and ŷ, as defined in, 3, 5, 9, 6, and 7, respectivel. For an x, s R d Separatel, for an R n and s R d, gap s, x, ŷx = F x + x s. gap s, ˆx s,, = F ˆx s, + g s, + AT. 3 Proof. To prove the first identit, let ŷ = ŷx for brevit. Recall that ŷ i = φ ia T i x argmax i x T a i i φ i i } 4 b definition, and hence x T a i ŷ i φ i ŷ i = φ i a T i x. Observe that gap s, x, ŷ = φi a T i x + φ i ŷ i x T A T ŷ + AT ŷ + x s = φ i a T i x + φ i ŷ i x T a i ŷ i + }} AT ŷ + x s =0 b 4 = AT ŷ + x s n = a i φ ia T i x + x s = F x + x s. For the second identit 3, let ˆx = ˆx s, for brevit. Fix s R d and > 0. Define rx = x s, and note that its Fenchel conjugate is r = + s T. With this notation we can write: Observe that Note also that g s, ma be rewritten as f s, x = F x + rx g s, = G + r A T. ˆx = s AT = argmin x s + x T A T } x = argmin rx + x T A T } x = argmax x T A T rx } x = r A T. g s, = G + AT s T A T = G + T A + T A s T A T = G AT ˆx T A T. Combining the above two observations, and noting that the first implies equalit in the Fenchel-Young inequalit, gap s, ˆx, = f s, ˆx + g s, = F ˆx + G + rˆx + r A T = F ˆx + G ˆx T A T = F ˆx + g s, + AT,

proving the claim. C. Convergence analsis of Dual APPA The goal of this section is to establish the convergence rate of Algorithm 3, Dual APPA. It is structured as follows: Mirroring Lemma 3., we establish Lemma C., concerning contraction of the norm of the gradient, F, rather than of the error in function value F F opt. Similar to Lemma 3., this lemma is self-contained and assume onl that F is -strongl convex, but not that it is the ERM objective. We then establish directl a geometric rate upper bound on the value of F over the course of the algorithm, reling, in the process on Lemma C.. Finall, we quer the strong convexit of F to establish that it too is subject to the same rate up to an extra multiplicative factor. Lemma C. Gradient-norm reduction implies contraction. Suppose F : R d R is -strongl convex and that > 0. Define f s : R d R b f s x = fx + x s. For an x 0, x R d, F x + f x0 x + + + F x 0. 5 Corollar C.. Define γ def = / + and let τ be an scalar in the interval [γ,. For an x 0, x R d, F x τ F x 0, 6 whenever f x0 x τ γ F x 0. 7 + γ Proof of Lemma C.. Taking gradients of f x0 and f x we have that B the triangle inequalit, Because f x0 is + -strongl convex, Combining the previous two observations proves the claim. F x = F x + x x 0 x x 0 = f x0 x x x 0. F x f x0 x + x x 0. + x x 0 f x0 x f x0 x 0 f x0 x + f x0 x 0 = f x0 x + F x 0. Throughout the remainder of this section, we consider the functions defined in the main setup Section. namel, F, f s, g s,, ˆx s,, and ŷ, as defined in, 3, 5, 6, and 7, respectivel, where F is a -strongl convex sum whose constituent summands are each L-smooth scalar functions operating on the inner product of the variable x R d with a vector a i of Euclidean norm at most R. For notational brevit, we omit the subscript, and we denote iterates of the algorithm with numerical subscripts e.g. x, x,..., and,,... rather than parenthesized superscripts. We begin b bounding the error of the dual regularized ERM problem when the center of regularization changes. This characterizes the initial error at the beginning of each Dual APPA iteration.

Lemma C.3 Dual error is bounded across re-centering.. Fix R n and x R d. Let x = ˆx x. Then g x g opt x c g x g opt x + c F x, where c = R LnR L + + nr L + + and c = +. In other words, the dual error g s gs opt is bounded across a re-centering step b multiples of previous sub-optimalit measurements namel, dual error and gradient norm. Proof. We first show how re-centering, from x to x, increases the dualit gap between and x the new center. The dual function value changes from to g x = G + AT x T A T g x = G + AT x T A T = G + AT = g x + AT = g x + AT = g x + x AT x = g x + x x x AT T A T i.e. it increases b x x. Meanwhile, the primal function value changes from f x x = F x + x x to f x x = F x, i.e. it decreases b x x. Hence, in total, the new dualit gap is gap x x, = gap x x, + x x gap x x, + + f xx f x x gap x x, + fx + x + f x x gap x x, + fx + x + F x, where the first inequalit follows b + -strong convexit of f x.

Combining the last two chains of inequalities, and abbreviating L = nr L +, we bound the re-centered dual error of : g x g opt x gap x x, gap x x, + R L L g x gx opt + R L L g x gx opt + = R L L + + fx x + F x L + fx + x + F x R L L + g x gx opt + F x g x gx opt + + F x. where the third and fourth inequalities follow from Corollar B.3 the fourth first reling on the L-smoothness of f x, that is: f x x Lf x x fx opt R L L g x gx opt. 8 Recalling the choices of numerical constants c and c, the claim is proven. Finall, we prove that for an appropriate choice of oracle parameter σ independent of the target error ɛ the iterates produced b Dual APPA are bounded above b a geometric sequence. In doing so, it will be convenient to abbreviate the same numerical constant as in 8. c 3 = R LnR L +, Proposition C.4 Convergence rate of Dual APPA in gradient norm. Define γ = / + and let τ be an scalar in [γ,. Define r = max /, τ/ }, τ, 9 and σ = max c 3 τ γ, + γ c + c + c c 3, r r } τ γ c 3 c + c + c c 3. 0 + γ In the execution of Dual APPA Algorithm 3, for ever iteration t, where c = F x 0. g xt t g opt x t cr t, and F x t cr t, Proof. The argument proceeds b strong induction on t. We begin the inductive argument with base case t =. For the first invariant, the algorithm takes σ / and we have g x0 g opt x 0 σ g x 0 0 g opt x 0 σ F x 0 σ 4 c 3 because the algorithm explicitl takes 0 = ŷx 0 see Lemma B.4. So this part of the claim holds as r / and σ /.

The second invariant holds immediatel as c = F x 0 and r /. We will also explicitl handle the case t = for the second invariant. B 3 and our choice of σ, we have that and so b Corollar C., f x0 x c 3 g x0 g opt x 0 c 3 σ and the claim follows as r τ / F x 0 F x τ F x 0 τ c. τ γ F x 0. + γ Now consider t. Concerning the first invariant, the algorithm takes σ r c + c + c c 3 and so we have g xt t g opt x 0 σ g x t t gx opt t [ c + c c 3 g xt t gx opt σ t + c F x t ] [ c + c c 3 cr t + c cr t ] σ σ cc + c + c c 3 r t cr t. For the second invariant, we have alread handled the case of t =, and so assume that t 3. B Lemmas C. and C., and b the choice of σ, we have that F x t + γ f xt x t + γ F x t + γ c 3 g xt t gx opt t + γ F x t + γ c 3 σ g x t t gx opt t + γ F x t [ + γ c 3 c + c c 3 g xt 3 t gx opt σ t 3 + c F x t 3 ] + γ F x t + γ c 3 σ c + c + c c 3 cr t + γ cr t τ γ cr t + γ cr t τ cr t cr t, where the final inequalit is due to the choice of r τ. This completes the inductive argument. Equipped with Proposition C.4, we translate it into a statement about the convergence rate of Dual APPA in error, rather than in gradient norm, to prove Theorem.0: Proof of Theorem.0. Define γ, τ, r, and σ as in Proposition C.4. In the execution of Dual APPA Algorithm 3, at each iteration t, we claim that F x t F opt nr L ɛ 0 r t+, 4

where ɛ 0 = F x 0 F opt. Consequentl, for an ɛ > 0, in order to achieve ɛ error in Dual APPA, it suffices to take a number of iterations T nr L log + log ɛ 0. r ɛ The first part of the claim follows directl from Proposition C.4, using the smoothness and strong convexit of F. The second part follows from a calculation sufficient to make the right hand side of 4 smaller than a given ɛ. D. Convergence analsis of Accelerated APPA The goal of this section is to establish the convergence rate of Algorithm, Accelerated APPA, and prove Theorem.6. Note that the results in this section use nothing about the structure of F other than strong convexit and thus the appl to the general ERM problem ; the are written in greater generalit to make this distinction clear. Aspects of the proofs in this section bear resemblance to the analsis in Shalev-Shwartz & Zhang 04, which achieves similar results in a more specialized setting. The rest of this section is structured as follows: In Lemma D. we show that appling a primal oracle to the inner minimization problem gives us a quadratic lower bound on F x. In Lemma D. we use this lower bound to construct a series of lower bounds for the main objective function f, and accelerate the APPA algorithm, comprising the bulk of the analsis. In Lemma D.3 we show that the requirements of Lemma D. can be met b using a primal oracle that decreases the error b a constant factor. In Lemma D.4 we analze the initial error requirements of Lemma D.. In Lemma D.5 we provide an auxiliar lemma that combines two quadratic functions that we use in the earlier proofs. The proof of Theorem.6 follows immediatel from these lemmas. Lemma D.. Let f R n R be -strongl convex and let f x0,x def = fx + x x 0. Suppose x + satisfies. f x0,x + min x R n f x 0,x + ɛ, Then for def = /, g def = x 0 x +, and all x R n we have fx fx + g + x x 0 + g + ɛ. Note that as = / we are onl losing a factor of in the strong convexit parameter for our lower bound. This allows us to account for the error without loss in our ultimate convergence rates. Proof. Let x opt = argmin x R n f x0,x. Since f is -strongl convex clearl f x0, is + strongl convex and b Lemma B. f x0,x f x0,x opt + x x opt. 5 B Cauch-Schwartz we know + x x + + x x opt + x opt x + + x xopt + + x opt x +,

which implies + x x opt + x x + + + x opt x +. On the other hand, since f x0,x + f x0,x opt + ɛ b 5 we have + x+ x opt ɛ and therefore f x0,x f x0,x + f x0,x f x0,x opt ɛ + x x opt ɛ + x x + + + x opt x + ɛ + x x + + ɛ. Now since x x + = x x 0 + g = x x 0 + g, x x 0 + g, and using the fact that f x0,x = fx + x x 0, we have [ ] fx fx + + + g + + g, x x 0 + x x 0 + ɛ. The right hand side of the above equation is a quadratic function. Looking at its gradient with respect to x we see that it obtains its minimum when x = x 0 + g and has a minimum value of fx + g + ɛ. Lemma D.. Let f : R n def R be -strongl convex and suppose that in each iteration t we have ψ t = ψ opt t + x vt such that fx ψ t x for all x. Then if we let ρ def = +, f, x def = fx + x for 3 and t def = x t + ρ / v t, +ρ / +ρ / E[f t,x t+ ] f opt t, ρ 3/ 4 fx t ψ opt t, t def g = t x t+, t+ def v = ρ / v t + ρ [ / t + g t ]. We have that E[fx t ψ opt t ] ρ / t fx 0 ψ opt 0. Proof. Regardless of how t is chosen we know b Lemma D. that for γ = + fx fx t+ gt + x t γ gt and all x Rn + f,x t+ f opt. 6 t t,

Thus, for β = ρ / we can let [ ψ t+ x def = βψ t x + β fx t+ gt + x t γ gt + ] t, = β f,x t+ f opt t [ ] ψ opt t + x vt + β + f t,x t+ f opt t, ] = ψ opt t+ + x vt+. [ fx t+ gt + x t γ gt where in the last line we used Lemma D.5. Again, b Lemma D.5 we know that ψ opt t+ = βψ t + β fx t+ gt + f,x t+ f opt t t, + β β vt t γ gt βψ t + βfx t+ β g t + β βγ g t, v t t In the second step we used the following fact: β + f t,x t+ f opt t,. β + β β γ = β + βγ β. Furthermore, expanding the term x t + γ gt and instantiating x with x t in 6 ields fx t+ fx t gt + γ g t, t x t + + f,x t+ f opt t t,. Consequentl we know [ fx t+ ψ opt t+ β[fxt ψ opt β t ] + β ] g t + γβ g t, t x t βv t t + + f,x t+ f opt t t, Note that we have chosen t so that the inner product term equals 0, and we choose β = ρ / which ensures β β + 0. Also, b assumption we know E[f,x t+ f opt t t, ] ρ 3/ 4 fx t ψ opt t, which implies E[fx t+ ψ opt t+ ] β + + ρ 3/ fx t ψ opt t ρ / /fx t ψ opt t. 4 In the final step we are using the fact that + ρ and ρ.

Lemma D.3. Under the setting of Lemma D., we have f,x t f opt t t, fxt ψ opt t. In particular in order to achieve E[f,x t+ t ρ 3/ 8 fx t ψ opt t we onl need an oracle that shrinks the function error b a factor of ρ 3/ 8 in expectation. Proof. We know f t,x t fx t = xt t = ρ + ρ / xt v t. We will tr to show the lower bound f opt is larger than t ψopt, t b the same amount. This is because for all x we have f t,x = fx + x t ψ opt t + x vt + x t. The RHS is a quadratic function, whose optimal point is at x = v t + t + and whose optimal value is equal to ψ opt t + v t t + v t t = ψ opt t + + + + + ρ / xt v t. B definition of ρ, we know f t,x t f opt t, fxt ψ opt t. + +ρ / x t v t is exactl equal to ρ +ρ / x t v t, therefore Remark In the next lemma we show that moving to the regularized problem has the same effect on the primal function value and the lower bound. This is a result of the choice of β in the proof of Lemma D.. However this does not mean the choice of β is ver fragile, we can choose an β that is between the current β and ; the influence to this lemma will be that the increase in primal function becomes smaller than the increase in the lower bound so the lemma continues to hold. Lemma D.4. Let ψ opt 0 = fx 0 + fx 0 f opt, and v 0 = x 0 def, then ψ 0 = ψ opt 0 + x v 0 is a valid lower bound for f. In particular when = LR then fx 0 ψ opt 0 κfx 0 f opt. Proof. This lemma is a direct corollar of Lemma D. with x + = x 0. Lemma D.5. Suppose that for all x we have f x def = ψ + x v and f x = ψ + x v then where αf x + αf x = ψ α + x v α v α = αv + αv and ψ α = αψ + αψ + α α v v Proof. Setting the gradient of αf x + αf x to 0 we know that v α must satisf α v α v + α v α v = 0 and thus Finall, v α = αv + αv. [ ψ α = α ψ + ] [ v α v + α ψ + ] v α v = αψ + αψ + [ α α v v + αα v v ] = αψ + αψ + α α v v