A. Derivation of regularized ERM duality

Similar documents
Linear programming: Theory

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

Optimization for Machine Learning

Approximation of Continuous Functions

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

PICONE S IDENTITY FOR A SYSTEM OF FIRST-ORDER NONLINEAR PARTIAL DIFFERENTIAL EQUATIONS

Notes on Convexity. Roy Radner Stern School, NYU. September 11, 2006

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 6 Fall 2009

Conditional Gradient (Frank-Wolfe) Method

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak- Lojasiewicz Condition

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Lecture 16: FTRL and Online Mirror Descent

Lagrange Relaxation and Duality

Iteration-complexity of first-order penalty methods for convex programming

CS711008Z Algorithm Design and Analysis

THE INVERSE FUNCTION THEOREM

Dual and primal-dual methods

An interior-point gradient method for large-scale totally nonnegative least squares problems

Interior-Point Methods for Linear Optimization

Newton s Method. Javier Peña Convex Optimization /36-725

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Convex Optimization Conjugate, Subdifferential, Proximation

Cubic regularization of Newton s method for convex problems with constraints

Coordinate Descent and Ascent Methods

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Self Equivalence of the Alternating Direction Method of Multipliers

4 Inverse function theorem

Locally convex spaces, the hyperplane separation theorem, and the Krein-Milman theorem

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

A Greedy Framework for First-Order Optimization

L p Spaces and Convexity

Nonlinear Programming

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Constrained Optimization and Lagrangian Duality

Accelerating Stochastic Optimization

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

UC Berkeley Department of Electrical Engineering and Computer Science. EECS 227A Nonlinear and Convex Optimization. Solutions 5 Fall 2009

Nonlinear Programming

Online Convex Optimization

Robust Nash Equilibria and Second-Order Cone Complementarity Problems

Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

The Proximal Gradient Method

Week 3 September 5-7.

Dual Proximal Gradient Method

Nonlinear Systems Theory

1. Introduction. An optimization problem with value function objective is a problem of the form. F (x, y) subject to y K(x) (1.2)

Exact SDP Relaxations for Classes of Nonlinear Semidefinite Programming Problems

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

PARTIAL SEMIGROUPS AND PRIMALITY INDICATORS IN THE FRACTAL GENERATION OF BINOMIAL COEFFICIENTS TO A PRIME SQUARE MODULUS

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Dual Decomposition.

Unconstrained optimization

Quantitative stability of the Brunn-Minkowski inequality for sets of equal volume

Additional Homework Problems

Lecture 5. Theorems of Alternatives and Self-Dual Embedding

MAT389 Fall 2016, Problem Set 2

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

arxiv: v1 [math.oc] 1 Jul 2016

Lecture 13 Newton-type Methods A Newton Method for VIs. October 20, 2008

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

Lecture 4: Exact ODE s

The Entropy Power Inequality and Mrs. Gerber s Lemma for Groups of order 2 n

Optimality Conditions for Constrained Optimization

Lecture: Duality of LP, SOCP and SDP

Sec. 1.1: Basics of Vectors

A Beckman-Quarles type theorem for linear fractional transformations of the extended double plane

Implications of the Constant Rank Constraint Qualification

Finite Dimensional Optimization Part III: Convex Optimization 1

Unconstrained minimization of smooth functions

Second Proof: Every Positive Integer is a Frobenius Number of Three Generators

Examination paper for TMA4180 Optimization I

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Math 361: Homework 1 Solutions

REVIEW OF DIFFERENTIAL CALCULUS

Gradient Sliding for Composite Optimization

LMI Methods in Optimal and Robust Control

A sharp Blaschke Santaló inequality for α-concave functions

PDE Constrained Optimization selected Proofs

arxiv: v7 [math.oc] 22 Feb 2018

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Lecture 8. Strong Duality Results. September 22, 2008

Second Order Optimality Conditions for Constrained Nonlinear Programming

On the Local Quadratic Convergence of the Primal-Dual Augmented Lagrangian Method

8. Conjugate functions

Some Results Concerning Uniqueness of Triangle Sequences

Lecture 5. The Dual Cone and Dual Problem

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

A GLOBALLY CONVERGENT STABILIZED SQP METHOD: SUPERLINEAR CONVERGENCE

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Stochastic model-based minimization under high-order growth

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

minimize x subject to (x 2)(x 4) u,

LECTURE 10 LECTURE OUTLINE

Orthogonality. 6.1 Orthogonal Vectors and Subspaces. Chapter 6

Problem 1 (Exercise 2.2, Monograph)

Transcription:

A. Derivation of regularized ERM dualit For completeness, in this section we derive the dual 5 to the problem of computing proximal operator for the ERM objective 3. We can rewrite the primal problem as B convex dualit, this is equivalent to min max x,z i} R n φ i z i + x s + T Ax z = max n min x R d,z R n φ iz i + x s. subject to z i = a T i x, for i =,..., n = max = max = max = max = min The final negated problem is precisel the dual formulation. min x,z i} min x,z i} φ i z i + x s + T Ax z φ i z i i z i + x s + T Ax min φ i z i i z i } + min z i x max i z i φ i z i } max z i x φ i i AT + s T A T φ i i + AT s T A T. The first problem is a Lagrangian saddle-point problem, where the Lagrangian is defined as Lx,, z = φ i z i + x s + T Ax z. } x s + T Ax T Ax x s The dual-to-primal mapping 6 and primal-to-dual mapping 7 are implied b the KKT conditions under L, and can be derived b solving for x,, and z in the sstem Lx,, z = 0. The dualit gap in this context is defined as gap s, x, = f s, x + g s,. 9 Strong dualit dictates that gap s, x, 0 for all x R d, R n, with equalit attained when x is primal-optimal and is dual-optimal. B. Additional lemmas This section establishes technical lemmas, useful in statements and proofs throughout the paper. Lemma B. Standard bounds for smooth, strongl convex functions. Let f : R k R be differentiable and let x R k. Furthermore, let x opt be a minimizer of f. If f is L-smooth then L fx fx fx opt L x xopt }

If f is -strongl convex we have x xopt fx fx opt fx Proof. Appl the definition of smoothness and strong convexit at the points x and x opt and minimize the resulting quadratic form. Lemma B. Errors for primal-dual pairs. Consider F, f s, g s,, ˆx s,, and ŷ, as defined in, 3, 5, 6, and 7, respectivel. Then for all R n, Furthermore let f s, ˆx s, f opt s, R LnR L + g s, g opt s,, Then for all R n, x opt s, = argmin f s,x and opt s, = argmin g s,x. ˆx s, x opt s, nr opt s,. Proof. Because F is nr L smooth, f s, is nr L + smooth. Consequentl, for all x R d we have f s, x f opt s, xopt Since we know that x opt s, = s AT opt s, and AAT nr I, s, nr L + f s, ˆx x, f s, x opt s, nr L + x x opt s, s AT s AT opt s, = nr L + opt s, AA T nr nr L + opt s,. 0 Finall, since each φ i is /L strongl convex, G is n/l strongl convex and hence so is g s,. Therefore, Substituting in 0 ields the result. n opt s, L g s, g s, opt s,. B adding the dual error to both sides of the inequalit in Lemma B., we obtain the following corollar. Corollar B.3 Dual error bounds gap. Consider g s,, gap s,, ˆx s,, and ŷ, as defined in 5, 9, 6, and 7, respectivel. For all s R d and R n, Another corollar arises b combining Lemmas B. and B.3. gap s, ˆx s,, R LnR L + g s, g opt s,. Corollar B.4 Dual error bounds primal gradient. Consider f s,, g s,, and ˆx s,, as defined in 3, 5, and 6, respectivel. For all s R d and R n, f s, ˆx s, nr L + f s, ˆx s, f opt s, R LnR L + g s, g opt s,.

Lemma B.5 Gap for primal-dual pairs. Consider F, f s, g s,, gap s,, ˆx s,, and ŷ, as defined in, 3, 5, 9, 6, and 7, respectivel. For an x, s R d Separatel, for an R n and s R d, gap s, x, ŷx = F x + x s. gap s, ˆx s,, = F ˆx s, + g s, + AT. 3 Proof. To prove the first identit, let ŷ = ŷx for brevit. Recall that ŷ i = φ ia T i x argmax i x T a i i φ i i } 4 b definition, and hence x T a i ŷ i φ i ŷ i = φ i a T i x. Observe that gap s, x, ŷ = φi a T i x + φ i ŷ i x T A T ŷ + AT ŷ + x s = φ i a T i x + φ i ŷ i x T a i ŷ i + }} AT ŷ + x s =0 b 4 = AT ŷ + x s n = a i φ ia T i x + x s = F x + x s. For the second identit 3, let ˆx = ˆx s, for brevit. Fix s R d and > 0. Define rx = x s, and note that its Fenchel conjugate is r = + s T. With this notation we can write: Observe that Note also that g s, ma be rewritten as f s, x = F x + rx g s, = G + r A T. ˆx = s AT = argmin x s + x T A T } x = argmin rx + x T A T } x = argmax x T A T rx } x = r A T. g s, = G + AT s T A T = G + T A + T A s T A T = G AT ˆx T A T. Combining the above two observations, and noting that the first implies equalit in the Fenchel-Young inequalit, gap s, ˆx, = f s, ˆx + g s, = F ˆx + G + rˆx + r A T = F ˆx + G ˆx T A T = F ˆx + g s, + AT,

proving the claim. C. Convergence analsis of Dual APPA The goal of this section is to establish the convergence rate of Algorithm 3, Dual APPA. It is structured as follows: Mirroring Lemma 3., we establish Lemma C., concerning contraction of the norm of the gradient, F, rather than of the error in function value F F opt. Similar to Lemma 3., this lemma is self-contained and assume onl that F is -strongl convex, but not that it is the ERM objective. We then establish directl a geometric rate upper bound on the value of F over the course of the algorithm, reling, in the process on Lemma C.. Finall, we quer the strong convexit of F to establish that it too is subject to the same rate up to an extra multiplicative factor. Lemma C. Gradient-norm reduction implies contraction. Suppose F : R d R is -strongl convex and that > 0. Define f s : R d R b f s x = fx + x s. For an x 0, x R d, F x + f x0 x + + + F x 0. 5 Corollar C.. Define γ def = / + and let τ be an scalar in the interval [γ,. For an x 0, x R d, F x τ F x 0, 6 whenever f x0 x τ γ F x 0. 7 + γ Proof of Lemma C.. Taking gradients of f x0 and f x we have that B the triangle inequalit, Because f x0 is + -strongl convex, Combining the previous two observations proves the claim. F x = F x + x x 0 x x 0 = f x0 x x x 0. F x f x0 x + x x 0. + x x 0 f x0 x f x0 x 0 f x0 x + f x0 x 0 = f x0 x + F x 0. Throughout the remainder of this section, we consider the functions defined in the main setup Section. namel, F, f s, g s,, ˆx s,, and ŷ, as defined in, 3, 5, 6, and 7, respectivel, where F is a -strongl convex sum whose constituent summands are each L-smooth scalar functions operating on the inner product of the variable x R d with a vector a i of Euclidean norm at most R. For notational brevit, we omit the subscript, and we denote iterates of the algorithm with numerical subscripts e.g. x, x,..., and,,... rather than parenthesized superscripts. We begin b bounding the error of the dual regularized ERM problem when the center of regularization changes. This characterizes the initial error at the beginning of each Dual APPA iteration.

Lemma C.3 Dual error is bounded across re-centering.. Fix R n and x R d. Let x = ˆx x. Then g x g opt x c g x g opt x + c F x, where c = R LnR L + + nr L + + and c = +. In other words, the dual error g s gs opt is bounded across a re-centering step b multiples of previous sub-optimalit measurements namel, dual error and gradient norm. Proof. We first show how re-centering, from x to x, increases the dualit gap between and x the new center. The dual function value changes from to g x = G + AT x T A T g x = G + AT x T A T = G + AT = g x + AT = g x + AT = g x + x AT x = g x + x x x AT T A T i.e. it increases b x x. Meanwhile, the primal function value changes from f x x = F x + x x to f x x = F x, i.e. it decreases b x x. Hence, in total, the new dualit gap is gap x x, = gap x x, + x x gap x x, + + f xx f x x gap x x, + fx + x + f x x gap x x, + fx + x + F x, where the first inequalit follows b + -strong convexit of f x.

Combining the last two chains of inequalities, and abbreviating L = nr L +, we bound the re-centered dual error of : g x g opt x gap x x, gap x x, + R L L g x gx opt + R L L g x gx opt + = R L L + + fx x + F x L + fx + x + F x R L L + g x gx opt + F x g x gx opt + + F x. where the third and fourth inequalities follow from Corollar B.3 the fourth first reling on the L-smoothness of f x, that is: f x x Lf x x fx opt R L L g x gx opt. 8 Recalling the choices of numerical constants c and c, the claim is proven. Finall, we prove that for an appropriate choice of oracle parameter σ independent of the target error ɛ the iterates produced b Dual APPA are bounded above b a geometric sequence. In doing so, it will be convenient to abbreviate the same numerical constant as in 8. c 3 = R LnR L +, Proposition C.4 Convergence rate of Dual APPA in gradient norm. Define γ = / + and let τ be an scalar in [γ,. Define r = max /, τ/ }, τ, 9 and σ = max c 3 τ γ, + γ c + c + c c 3, r r } τ γ c 3 c + c + c c 3. 0 + γ In the execution of Dual APPA Algorithm 3, for ever iteration t, where c = F x 0. g xt t g opt x t cr t, and F x t cr t, Proof. The argument proceeds b strong induction on t. We begin the inductive argument with base case t =. For the first invariant, the algorithm takes σ / and we have g x0 g opt x 0 σ g x 0 0 g opt x 0 σ F x 0 σ 4 c 3 because the algorithm explicitl takes 0 = ŷx 0 see Lemma B.4. So this part of the claim holds as r / and σ /.

The second invariant holds immediatel as c = F x 0 and r /. We will also explicitl handle the case t = for the second invariant. B 3 and our choice of σ, we have that and so b Corollar C., f x0 x c 3 g x0 g opt x 0 c 3 σ and the claim follows as r τ / F x 0 F x τ F x 0 τ c. τ γ F x 0. + γ Now consider t. Concerning the first invariant, the algorithm takes σ r c + c + c c 3 and so we have g xt t g opt x 0 σ g x t t gx opt t [ c + c c 3 g xt t gx opt σ t + c F x t ] [ c + c c 3 cr t + c cr t ] σ σ cc + c + c c 3 r t cr t. For the second invariant, we have alread handled the case of t =, and so assume that t 3. B Lemmas C. and C., and b the choice of σ, we have that F x t + γ f xt x t + γ F x t + γ c 3 g xt t gx opt t + γ F x t + γ c 3 σ g x t t gx opt t + γ F x t [ + γ c 3 c + c c 3 g xt 3 t gx opt σ t 3 + c F x t 3 ] + γ F x t + γ c 3 σ c + c + c c 3 cr t + γ cr t τ γ cr t + γ cr t τ cr t cr t, where the final inequalit is due to the choice of r τ. This completes the inductive argument. Equipped with Proposition C.4, we translate it into a statement about the convergence rate of Dual APPA in error, rather than in gradient norm, to prove Theorem.0: Proof of Theorem.0. Define γ, τ, r, and σ as in Proposition C.4. In the execution of Dual APPA Algorithm 3, at each iteration t, we claim that F x t F opt nr L ɛ 0 r t+, 4

where ɛ 0 = F x 0 F opt. Consequentl, for an ɛ > 0, in order to achieve ɛ error in Dual APPA, it suffices to take a number of iterations T nr L log + log ɛ 0. r ɛ The first part of the claim follows directl from Proposition C.4, using the smoothness and strong convexit of F. The second part follows from a calculation sufficient to make the right hand side of 4 smaller than a given ɛ. D. Convergence analsis of Accelerated APPA The goal of this section is to establish the convergence rate of Algorithm, Accelerated APPA, and prove Theorem.6. Note that the results in this section use nothing about the structure of F other than strong convexit and thus the appl to the general ERM problem ; the are written in greater generalit to make this distinction clear. Aspects of the proofs in this section bear resemblance to the analsis in Shalev-Shwartz & Zhang 04, which achieves similar results in a more specialized setting. The rest of this section is structured as follows: In Lemma D. we show that appling a primal oracle to the inner minimization problem gives us a quadratic lower bound on F x. In Lemma D. we use this lower bound to construct a series of lower bounds for the main objective function f, and accelerate the APPA algorithm, comprising the bulk of the analsis. In Lemma D.3 we show that the requirements of Lemma D. can be met b using a primal oracle that decreases the error b a constant factor. In Lemma D.4 we analze the initial error requirements of Lemma D.. In Lemma D.5 we provide an auxiliar lemma that combines two quadratic functions that we use in the earlier proofs. The proof of Theorem.6 follows immediatel from these lemmas. Lemma D.. Let f R n R be -strongl convex and let f x0,x def = fx + x x 0. Suppose x + satisfies. f x0,x + min x R n f x 0,x + ɛ, Then for def = /, g def = x 0 x +, and all x R n we have fx fx + g + x x 0 + g + ɛ. Note that as = / we are onl losing a factor of in the strong convexit parameter for our lower bound. This allows us to account for the error without loss in our ultimate convergence rates. Proof. Let x opt = argmin x R n f x0,x. Since f is -strongl convex clearl f x0, is + strongl convex and b Lemma B. f x0,x f x0,x opt + x x opt. 5 B Cauch-Schwartz we know + x x + + x x opt + x opt x + + x xopt + + x opt x +,

which implies + x x opt + x x + + + x opt x +. On the other hand, since f x0,x + f x0,x opt + ɛ b 5 we have + x+ x opt ɛ and therefore f x0,x f x0,x + f x0,x f x0,x opt ɛ + x x opt ɛ + x x + + + x opt x + ɛ + x x + + ɛ. Now since x x + = x x 0 + g = x x 0 + g, x x 0 + g, and using the fact that f x0,x = fx + x x 0, we have [ ] fx fx + + + g + + g, x x 0 + x x 0 + ɛ. The right hand side of the above equation is a quadratic function. Looking at its gradient with respect to x we see that it obtains its minimum when x = x 0 + g and has a minimum value of fx + g + ɛ. Lemma D.. Let f : R n def R be -strongl convex and suppose that in each iteration t we have ψ t = ψ opt t + x vt such that fx ψ t x for all x. Then if we let ρ def = +, f, x def = fx + x for 3 and t def = x t + ρ / v t, +ρ / +ρ / E[f t,x t+ ] f opt t, ρ 3/ 4 fx t ψ opt t, t def g = t x t+, t+ def v = ρ / v t + ρ [ / t + g t ]. We have that E[fx t ψ opt t ] ρ / t fx 0 ψ opt 0. Proof. Regardless of how t is chosen we know b Lemma D. that for γ = + fx fx t+ gt + x t γ gt and all x Rn + f,x t+ f opt. 6 t t,

Thus, for β = ρ / we can let [ ψ t+ x def = βψ t x + β fx t+ gt + x t γ gt + ] t, = β f,x t+ f opt t [ ] ψ opt t + x vt + β + f t,x t+ f opt t, ] = ψ opt t+ + x vt+. [ fx t+ gt + x t γ gt where in the last line we used Lemma D.5. Again, b Lemma D.5 we know that ψ opt t+ = βψ t + β fx t+ gt + f,x t+ f opt t t, + β β vt t γ gt βψ t + βfx t+ β g t + β βγ g t, v t t In the second step we used the following fact: β + f t,x t+ f opt t,. β + β β γ = β + βγ β. Furthermore, expanding the term x t + γ gt and instantiating x with x t in 6 ields fx t+ fx t gt + γ g t, t x t + + f,x t+ f opt t t,. Consequentl we know [ fx t+ ψ opt t+ β[fxt ψ opt β t ] + β ] g t + γβ g t, t x t βv t t + + f,x t+ f opt t t, Note that we have chosen t so that the inner product term equals 0, and we choose β = ρ / which ensures β β + 0. Also, b assumption we know E[f,x t+ f opt t t, ] ρ 3/ 4 fx t ψ opt t, which implies E[fx t+ ψ opt t+ ] β + + ρ 3/ fx t ψ opt t ρ / /fx t ψ opt t. 4 In the final step we are using the fact that + ρ and ρ.

Lemma D.3. Under the setting of Lemma D., we have f,x t f opt t t, fxt ψ opt t. In particular in order to achieve E[f,x t+ t ρ 3/ 8 fx t ψ opt t we onl need an oracle that shrinks the function error b a factor of ρ 3/ 8 in expectation. Proof. We know f t,x t fx t = xt t = ρ + ρ / xt v t. We will tr to show the lower bound f opt is larger than t ψopt, t b the same amount. This is because for all x we have f t,x = fx + x t ψ opt t + x vt + x t. The RHS is a quadratic function, whose optimal point is at x = v t + t + and whose optimal value is equal to ψ opt t + v t t + v t t = ψ opt t + + + + + ρ / xt v t. B definition of ρ, we know f t,x t f opt t, fxt ψ opt t. + +ρ / x t v t is exactl equal to ρ +ρ / x t v t, therefore Remark In the next lemma we show that moving to the regularized problem has the same effect on the primal function value and the lower bound. This is a result of the choice of β in the proof of Lemma D.. However this does not mean the choice of β is ver fragile, we can choose an β that is between the current β and ; the influence to this lemma will be that the increase in primal function becomes smaller than the increase in the lower bound so the lemma continues to hold. Lemma D.4. Let ψ opt 0 = fx 0 + fx 0 f opt, and v 0 = x 0 def, then ψ 0 = ψ opt 0 + x v 0 is a valid lower bound for f. In particular when = LR then fx 0 ψ opt 0 κfx 0 f opt. Proof. This lemma is a direct corollar of Lemma D. with x + = x 0. Lemma D.5. Suppose that for all x we have f x def = ψ + x v and f x = ψ + x v then where αf x + αf x = ψ α + x v α v α = αv + αv and ψ α = αψ + αψ + α α v v Proof. Setting the gradient of αf x + αf x to 0 we know that v α must satisf α v α v + α v α v = 0 and thus Finall, v α = αv + αv. [ ψ α = α ψ + ] [ v α v + α ψ + ] v α v = αψ + αψ + [ α α v v + αα v v ] = αψ + αψ + α α v v