arxiv: v2 [math.oc] 25 Mar 2018

Similar documents
Accelerated primal-dual methods for linearly constrained convex problems

HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING

ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING

Optimality Conditions for Constrained Optimization

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers

Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Iteration-complexity of first-order penalty methods for convex programming

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

arxiv: v7 [math.oc] 22 Feb 2018

Primal/Dual Decomposition Methods

Math 273a: Optimization Subgradients of convex functions

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

arxiv: v1 [math.oc] 13 Dec 2018

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

A GLOBALLY CONVERGENT STABILIZED SQP METHOD: SUPERLINEAR CONVERGENCE

Lecture 3. Optimization Problems and Iterative Algorithms

ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS

SEMI-SMOOTH SECOND-ORDER TYPE METHODS FOR COMPOSITE CONVEX PROGRAMS

Generalized Uniformly Optimal Methods for Nonlinear Programming

Coordinate Descent and Ascent Methods

Coordinate descent methods

A STABILIZED SQP METHOD: SUPERLINEAR CONVERGENCE

Dual Proximal Gradient Method

On the acceleration of augmented Lagrangian method for linearly constrained optimization

Subgradient Methods in Network Resource Allocation: Rate Analysis

5. Duality. Lagrangian

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Quiz Discussion. IE417: Nonlinear Programming: Lecture 12. Motivation. Why do we care? Jeff Linderoth. 16th March 2006

Convex Optimization M2

12. Interior-point methods

Machine Learning. Support Vector Machines. Manfred Huber

Some Inexact Hybrid Proximal Augmented Lagrangian Algorithms

COMPLEXITY OF A QUADRATIC PENALTY ACCELERATED INEXACT PROXIMAL POINT METHOD FOR SOLVING LINEARLY CONSTRAINED NONCONVEX COMPOSITE PROGRAMS

Constrained Optimization and Lagrangian Duality

Support Vector Machines

Conditional Gradient (Frank-Wolfe) Method

Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16

Optimization methods

Perturbed Proximal Primal Dual Algorithm for Nonconvex Nonsmooth Optimization

WE consider an undirected, connected network of n

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

Algorithms for constrained local optimization

Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization

minimize x subject to (x 2)(x 4) u,

Sparse Optimization Lecture: Dual Methods, Part I


Algorithms for Nonsmooth Optimization

Convex Optimization Boyd & Vandenberghe. 5. Duality

Primal-dual relationship between Levenberg-Marquardt and central trajectories for linearly constrained convex optimization

CSCI : Optimization and Control of Networks. Review on Convex Optimization

Pacific Journal of Optimization (Vol. 2, No. 3, September 2006) ABSTRACT

CO 250 Final Exam Guide

Convex Optimization Theory. Chapter 5 Exercises and Solutions: Extended Version

Lecture 6: Conic Optimization September 8

Duality Theory of Constrained Optimization

On the complexity of the hybrid proximal extragradient method for the iterates and the ergodic mean

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Inexact alternating projections on nonconvex sets

1. f(β) 0 (that is, β is a feasible point for the constraints)

A SIMPLE PARALLEL ALGORITHM WITH AN O(1/T ) CONVERGENCE RATE FOR GENERAL CONVEX PROGRAMS

Motivation. Lecture 2 Topics from Optimization and Duality. network utility maximization (NUM) problem:

arxiv: v1 [math.oc] 1 Jul 2016

Barrier Method. Javier Peña Convex Optimization /36-725

A Brief Review on Convex Optimization

An accelerated non-euclidean hybrid proximal extragradient-type algorithm for convex-concave saddle-point problems

Iteration-complexity of first-order augmented Lagrangian methods for convex programming

Convex Analysis Notes. Lecturer: Adrian Lewis, Cornell ORIE Scribe: Kevin Kircher, Cornell MAE

FAST FIRST-ORDER METHODS FOR STABLE PRINCIPAL COMPONENT PURSUIT

Optimization for Machine Learning

Implications of the Constant Rank Constraint Qualification

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

A STABILIZED SQP METHOD: GLOBAL CONVERGENCE

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

Interior-Point Methods for Linear Optimization

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Primal-dual first-order methods with O(1/ɛ) iteration-complexity for cone programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

CONSTRAINT QUALIFICATIONS, LAGRANGIAN DUALITY & SADDLE POINT OPTIMALITY CONDITIONS

Lectures 9 and 10: Constrained optimization problems and their optimality conditions

c 2013 Society for Industrial and Applied Mathematics

arxiv: v1 [math.oc] 5 Dec 2014

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

12. Interior-point methods

Some Properties of the Augmented Lagrangian in Cone Constrained Optimization

An inexact accelerated proximal gradient method for large scale linearly constrained convex SDP

1 Computing with constraints

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Distributed Convex Optimization

Key words. alternating direction method of multipliers, convex composite optimization, indefinite proximal terms, majorization, iteration-complexity

Gradient Sliding for Composite Optimization

The Proximal Gradient Method

CONSTRAINED OPTIMALITY CRITERIA

Transcription:

arxiv:1711.0581v [math.oc] 5 Mar 018 Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming Yangyang Xu Abstract Augmented Lagrangian method ALM has been popularly used for solving constrained optimization problems. Practically, subproblems for updating primal variables in the framewor of ALM usually can only be solved inexactly. The convergence and local convergence speed of ALM have been extensively studied. However, the global convergence rate of inexact ALM is still open for problems with nonlinear inequality constraints. In this paper, we wor on general convex programs with both equality and inequality constraints. For these problems, we establish the global convergence rate of inexact ALM and estimate its iteration complexity in terms of the number of gradient evaluations to produce a solution with a specified accuracy. We first establish an ergodic convergence rate result of inexact ALM that uses constant penalty parameters or geometrically increasing penalty parameters. Based on the convergence rate result, we apply Nesterov s optimal first-order method on each primal subproblem and estimate the iteration complexity of the inexact ALM. We show that if the objective is convex, then Oε 1 gradient evaluations are sufficient to guarantee an ε-optimal solution in terms of both primal objective and feasibility violation. If the objective is strongly convex, the result can be improved to Oε 1 logε. Finally, by relating to the inexact proximal point algorithm, we establish a nonergodic convergence rate result of inexact ALM that uses geometrically increasing penalty parameters. We show that the nonergodic iteration complexity result is in the same order as that for the ergodic result. Numerical experiments on quadratically constrained quadratic programming are conducted to compare the performance of the inexact ALM with different settings. Keywords: augmented Lagrangian method ALM, nonlinearly constrained problem, first-order method, global convergence rate, iteration complexity Mathematics Subject Classification: 90C06, 90C5, 68W40, 49M7. 1 Introduction In this paper, we consider the constrained convex programming minimizef 0 x, s.t. Ax = b,f i x 0,i = 1,...,m, 1 x X Y. Xu Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 1180 E-mail: xuy1@rpi.edu

Yangyang Xu where X R n is a closed convex set, A and b are respectively given matrix and vector, and f i is a convex function for every i = 0,1,...,m. Any convex optimization problem can be written in the standard form of 1. It appears in many areas including statistics, machine learning, data mining, engineering, signal processing, finance, operations research, and so on. Note that the constraint x X can be equivalently represented by using an inequality constraint ι X x 0 or adding ι X x to the objective, where ι X denotes the indicator function on X. However, we explicitly use it for technical reason. In addition, every affine constraint a j x = b j can be equivalently represented by two inequality constraints: a j x b j 0 and a j x+b j 0. That way does not change theoretical results of an algorithm but will mae the problem computationally more difficult. One popular method for solving 1 is the augmented Lagrangian method ALM, which first appeared in [16, 9]. ALM alternatingly updates the primal variable and the Lagrangian multipliers. At each update, the primal variable is renewed by minimizing the augmented Lagrangian AL function and the multipliers by a dual gradient ascent. The global convergence and local convergence rate of ALM have been extensively studied; see the boos [4, 5]. Several recent wors e.g., [14, 15] establish the global convergence rate of ALM and/or its variants for affinely constrained problems. In the framewor of ALM, the primal subproblem usually can only be solved inexactly, and thus practically inexact ALM ialm is often used. However, to the best of our nowledge, the global convergence rate of ialm for problems with nonlinear inequality constraints still remains open 1. We address this open question in this wor and also establish the iteration complexity of ialm in terms of the number of gradient evaluations. The iteration complexity result appears to be optimal. We will assume composite convex structure on 1. More specifically, we assume f 0 x = gx+hx, where g is a Lipschitz differentiable convex function, and h is a simple possibly nondifferentiable convex function. Also, f i is convex and Lipschitz differentiable for every i [m], namely, there are constants L 0,L 1,...,L m such that gˆx g x L 0 ˆx x, ˆx, x domh X, f i ˆx f i x L i ˆx x, ˆx, x domh X, i [m]. 3a 3b In addition, we assume the boundedness of domh X and denote its diameter as D = maximize ˆx x. ˆx, x domh X 1.1 Augmented Lagrangian function In the literature, there are several different penalty terms used in an augmented Lagrangian AL function, such as the classic one [30, 31], the quadratic penalty on constraint violation [3], and the exponential penalty [34]. The wor [] gives a general class of augmented penalty functions that satisfy certain properties. In 1 Although the global convergence rate in terms of augmented dual objective can be easily shown from existing wors e.g., see our discussion in section 5, that does not indicate the convergence speed from the perspective of the primal objective and feasibility. By simple, we mean the proximal mapping of h is easy to evaluate, i.e., it is easy to find a solution to min x X hx + 1 γ x ˆx for any ˆx and γ > 0.

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 3 this paper, we use the classic one. As discussed below, it can be derived from a quadratic penalty on an equivalent equality constrained problem. Introducing nonnegative slac variable s i s, one can write 1 to an equivalent form: minimize x X,s 0 f 0x, s.t. Ax = b,f i x+s i = 0,i = 1,...,m. 4 With quadratic penalty on the equality constraints, the AL function of 4 is L β x,s,y,z = f 0 x+y Ax b+ m z i fi x+s i + β Ax b + β m, fi x+s i where y and z are multipliers, and β is the augmented penalty parameter. Minimizing L β with respect to s 0 while fixing x,y and z, we have the optimal s given by [ s i = z ] i β f ix, i = 1,...,m. + Plugging the above s into L β gives where Let L β x,s,y,z = f 0 x+y Ax b+ β Ax b + m ψ β f i x,z i, uv + β ψ β u,v = u, if βu+v 0, v β, if βu+v < 0. 5 Ψ β x,z = and we obtain the classical AL function of 1: m ψ β f i x,z i, L β x,y,z = f 0 x+y Ax b+ β Ax b +Ψ β x,z. 6 1. Inexact augmented Lagrangian method The augmented Lagrangian methodalm was proposed in[16,9]. Within each iteration, ALM first updates the x variable by minimizing the AL function with respect to x while fixing y and z, and then it performs a dual gradient ascent update to y and z. In general, it is difficult to exactly minimize the AL function about x. A more realistic way is to solve the x-subproblem within a tolerance error, which leads to the inexact ALM. Its pseudocode is given in Algorithm 1 below. If ε = 0,, it reduces to the ALM. It is shown in [30] that the augmented dual function 3 d β y,z = min x X L β x,y,z is continuously differentiable, and d β is Lipschitz continuous with constant 1 β. In addition, it turns out that the inexact 3 Although [30] only considers the inequality constrained case, the results derived there apply to the case with both equality and inequality constraints.

4 Yangyang Xu Algorithm 1: Inexact augmented Lagrangian method for 1 1 Initialization: choose x 0,y 0,z 0 and {β,ρ } for = 0,1,... do 3 Find x +1 X such that L β x +1,y,z min x X L β x,y,z +ε. 7 4 Update y and z by y +1 = y +ρ Ax +1 b, 8 z +1 i = z i +ρ max z i β,f i x +1,i = 1,...,m. 9 ALM is an inexact augmented dual gradient ascent [31], and thus convergence rate of the inexact ALM in term of d β can be shown from existing results about inexact gradient method [33]. However, it is unclear from there how to get convergence rate result from the perspective of the primal problem, especially as β varies at different updates. Our analysis will be different from this line, and our results will be based on the primal problem. In addition, our requirement on the tolerance errors is weaer than that in [31]. The main results we establish in this paper are summarized as follows. Both ergodic and nonergodic convergence rate results are established. Theorem 1 Summary of main results For a given ε > 0, choose a positive integer K and numbers C β > 0,C ε > 0. Let {x,y,z } K be the iterates generated from Algorithm 1 with parameters set according to one of the follows: C ε 1. ρ = β = C β Kε,ε = ε C β,.. ρ = β = β g σ, for certain β g > 0 and σ > 1 such that β = C β ε, and ε = ε C ε C β,. 3. ρ = β = β g σ, for certain β g > 0 and σ > 1 such that β = C β ε. If f 0 is convex, let ε = Cε 1 β 3 1,, and if f t=0 β 3 0 is strongly convex, let ε = Cε 1 t β 1, t=0 β 1 t Then the averaged point x K = ρ x +1 t=0 ρt is an Oε-optimal solution see Definition 1, where the hidden constant depends on C β,c ε and dual solution y,z, and for the second and third settings, the actual point x K is also an Oε-optimal solution. In addition, the total number of evaluations on g and f i,i = 1,...,m is Oε 1 if f 0 is convex and Oε 1 logε if f 0 is strongly convex. The formal statement and the hidden constants are shown in Theorem 5 for the first setting, in Theorems 6 and 9 for the second setting, and in Theorems 7 and 9 for the third setting. 1.3 Contributions The contributions of this paper are mainly on establishing global convergence rate and estimating iteration complexity of ialm for solving 1, in both ergodic and nonergodic sense. They are listed as follows.

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 5 We first establish an ergodic convergence rate result of ialm through a novel analysis. With penalty parameters fixed to constant or increased geometrically, we choose the tolerance errors accordingly. By applying Nesterov s optimal first-order method to each x-subproblem, we show that to reach an ε-optimal solution, Oε 1 gradient evaluations are sufficient if the objective is convex, and the order is improved to Oε 1 logε if the objective is strongly convex. For the convex case, the result is optimal, and for the strongly convex, the result also appears the best compared to existing ones in the literature; see the discussions in Remar. We note that if ialm only runs to one iteration, i.e., a single x-subproblem 7 is solved, then the algorithm reduces to the penalty method by setting initial multipliers to zero vectors. Hence, as a byproduct, we establish the iteration complexity result of the penalty method for solving 1. The wor [19] analyzes first-order penalty method for solving affinely constrained problems. To have an ε-optimal solution, Oε gradient evaluations are required, if the penalty method is applied to the original problem, and Oε 1 logε gradient evaluations are required if the penalty method is applied to a carefully perturbed problem. Hence, our result is better than the former case by an order Oε 1 and the latter case by O logε. By relating to inexact proximal point algorithm ippa, we then establish a nonergodic convergence rate result of ialm through applying existing results in [3] about ialm. We show that with geometrically increasing penalty parameters, the nonergodic convergence of ialm enjoys the same order as that of the ergodic convergence, and the constant is just a few times larger. Compared to one recent nonergodic convergence result [1] for solving affinely constrained problem, our result is better by an order Oε 3. 1.4 Notation For simplicity, throughout the paper, we focus on a finite-dimensional Euclidean space, but our analysis can be directly extended to a general Hilbert space. We use italic letters a,c,b,l,..., for scalars, bold lower-case letters x,y,z,... for vectors, and bold upper-case letters A,B,... for matrices. z i denotes the i-th entry of a vector z. We use 0 to denote a vector of all zeros, and its size is clear from the context. [m] denotes the set {1,,...,m} for any positive integer m. Given a real number a, we let [a] + = max0,a and a be the smallest integer that is no less than a. For a vector a, [a] + taes the positive part of a in a component-wise manner. a denotes the Euclidean norm of a vector a and A the spectral norm of a matrix A. We denote l as the vector consisting of L i,i [m], where L i is the Lipschitz constant of f i in 3b. Also we let f be the vector function with f i as the i-th component scalar function. That is l = [L 1,...,L m ], fx = [f 1 x,...,f m x]. 10 Given a convex function f, fx represents one subgradient of f at x, namely, fˆx fx+ fx,ˆx x, ˆx, and fx denotes its subdifferential, i.e., the set of all subgradients. When f is differentiable, we simply write its subgradient as fx. For a convex set X, we use ι X as its indicator function, i.e., { 0, if x X, ι X x = +, if x X, and N X x = ι X x as its normal cone at x X. Given an ε > 0, the ε-optimal solution of 1 is defined as follows.

6 Yangyang Xu Definition 1 ε-optimal solution Let f0 called an ε-optimal solution to 1 if be the optimal value of 1. Given ε 0, a point x X is f 0 x f 0 ε, and Ax b + [fx]+ ε. 1.5 Outline The rest of the paper is organized as follows. In section, we give a few preparatory results and review Nesterov s optimal first-order method for solving a composite convex program. An ergodic convergence rate result of ialm is given in section 3, and a nonergodic convergence rate result is shown in section 4. Iteration complexity results in terms of the number of gradient evaluations are established for both ergodic and nonergodic cases. Related wors are reviewed and compared in section 5, and numerical results are provided in section 6. Finally section 7 concludes the paper. Preliminary results and Nesterov s optimal first-order method In this section, we give a few preliminary results and also review Nesterov s optimal first-order method for composite convex programs..1 Basic facts A point x, y, z satisfies the Karush-Kuhn-Tucer KKT conditions for 1 if m 0 f 0 x+n X x+a y+ z i f i x, Ax = b, x X, z i 0, f i x 0, z i f i x = 0, i [m]. 11a 11b 11c From the convexity of f i s, if x,y,z is a KKT point, then m f 0 x f 0 x + y,ax b + zif i x 0, x X. 1 The result below will be used to establish convergence rate of Algorithm 1. Lemma 1 Assume x,y,z satisfies the KKT conditions in 11. Let x be a point such that for any y and any z 0, m f 0 x f 0 x +y A x b+ z i f i x α+c 1 y +c z, 13 where α and c 1,c are nonnegative constants independent of y and z. Then α+4c 1 y +4c z f 0 x f 0 x α, 14 A x b + [f x]+ α+c1 1+ y +c 1+ z. 15

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 7 Proof. Letting y = 0 and z = 0 in 13 gives the second inequality in 14. For any nonnegative γ y and γ z, we let A x b y = γ y A x b, z = γ [f x] + z [f x]+ and have from 13 by using the convention 0 0 = 0 that Noting we have from 1 and 16 that f 0 x f 0 x +γ y A x b +γ z [f x]+ α+c1 γ y +c γ z. 16 y,a x b y A x b, m zi f i x z [f x]+, 17 γ y y A x b +γ z z [f x] + α+c1 γ y +c γ z In the above inequality, letting γ y = 1 + y and γ z = 1 + z gives 15, and letting γ y = y and γ z = z gives the first inequality in 14 by 1 and 17.. Nesterov s optimal first-order method In this subsection, we review Nesterov s optimal first-order method for composite convex programs. The method will be used to approximately solve x-subproblems in Algorithm 1. It aims at finding a solution of the following problem minimizeφx+ψx, 18 x where φ is a Lipschitz differentiable and strongly convex function with gradient Lipschitz constant L φ and strong convexity modulus µ 0, and ψ is a simple possibly nondifferentiable closed convex function. Algorithm summarizes the method. Here, for simplicity, we assume L φ and µ are nown. The method does not require the value of L φ but can estimate it by bactracing. In addition, it only requires a lower estimate of µ; see [7] for example. The theorem below gives the convergence rate of Algorithm for both convex i.e., µ = 0 and strongly convex i.e., µ > 0 cases; see [1,6,7]. We will use the results to estimate iteration complexity of ialm. Theorem Let {x } be the sequence generated from Algorithm. Assume x to be a minimizer of 18. The following results holds: 1. If µ = 0 and α 0 = 1, then. If µ > 0 and α 0 = µ L φ, then φx +ψx φx ψx L φ x 0 x φx φx L φ +µ x 0 x, 1. 19 µ 1, 1. 0 Lφ

8 Yangyang Xu Algorithm : Nesterov s optimal first-order method for 18 1 Initialization: choose ˆx 0 = x 0, α 0 0,1], and let q = µ L φ ; for = 0,1,..., do 3 Let 4 Set x +1 = argmin φˆx,x + L φ x x ˆx +ψx. and q α q + α +4α α +1 =, ˆx +1 = x +1 + α 1 α α +α x +1 x. +1 3 Ergodic convergence rate and iteration complexity results In this section, we first establish an ergodic convergence rate result of Algorithm 1. From that result, we then specify algorithm parameters and estimate the total number of gradient evaluations in order to produce an ε- optimal solution. Two different settings of the penalty parameters are studied: one with constant penalty and another with geometrically increasing penalty parameters. For each setting, the tolerance error parameter ε is chosen in an optimal way so that the total number of gradient evaluation is minimized. Throughout this section, we mae the following assumptions. Assumption 1 There exists a point x,y,z satisfying the KKT conditions in 11. Assumption For every, there is x +1 satisfying 7. The first assumption holds if a certain regularity condition is satisfied, such as the Slater condition namely, there is an interior point x of X such that Ax = b and f i x < 0, i [m]. The second assumption is for the well-definedness of the algorithm. It holds if X is compact and f i s are continuous. 3.1 Convergence rate analysis of ialm To show the convergence results of Algorithm 1, we first establish a few lemmas. Lemma Let y and z be updated by 8 and 9 respectively. Then for any, it holds 1 [ y +1 y y y + y +1 y ] y +1 y,r +1 = 0, ρ 1 1 [ z +1 z z z + z +1 z ] m z +1 i z i max z i,f i x +1 = 0, ρ β where r = Ax b. Proof. Using the equality u v = u u v + v, we have the results from the updates 8 and 9.

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 9 Lemma 3 For any z 0, we have m [z i +β f i x +1 m ] + z i fi x +1 z +1 i z i max z i β,f i x +1 1 ρ β ρ z +1 z. 3 Proof. Denote Then I + = {i [m] : z i +β f i x +1 0}, I = [m]\i +. 4 the left hand side of 3 = [ z i z i f i x +1 +β [f i x +1 ] zi +ρ f i x +1 z i fi x +1 ] i I + + i I [ z i f i x +1 zi ρ zi z ] z i i β β =β ρ [f i x +1 ] + i I + β ρ i I + i I [f i x +1 ] + 1 β = 1 ρ β ρ z +1 z, [ ] z i fi x +1 + z i 1 + β β β ρ zi β ρ zi i I where the inequality follows from z i 0 and f i x +1 + z i β 0, i I, and the last equality holds due to the update 9. The next theorem is a fundamental result by running one iteration of Algorithm 1. Theorem 3 One-iteration progress of ialm Let {x,y,z } be the sequence generated from Algorithm 1. Then for any x X such that Ax = b and f i x 0, i [m], any y, and any z 0, it holds that f 0 x +1 f 0 x+y r +1 + m z i f i x +1 + β ρ r +1 + β ρ ρ z +1 z + 1 y +1 y + 1 z +1 z ρ ρ 1 y y + 1 z z +ε. 5 ρ ρ Proof. From 7, it follows that for any x such that Ax = b, f 0 x +1 + y,r +1 + β r+1 +Ψ β x +1,z f 0 x Ψ β x,z ε. 6

10 Yangyang Xu Since y,r +1 = y +1 y,r +1 + y,r +1 ρ r +1, by adding 1 and to the above inequality, we have m m f 0 x +1 f 0 x+y r +1 + z i f i x +1 + [z i +β f i x +1 ] + z i fi x +1 + β ρ r +1 +Ψ β x +1,z m [zi +β f i x +1 ] + f i x +1 Ψ β x,z + 1 ρ [ y +1 y y y + y +1 y ] + 1 ρ [ z +1 z z z + z +1 z ] ε. Note that Ψ β x +1,z = i I + = m [zi +β f i x +1 ] + f i x +1 m z +1 i [ zi f ix +1 + β ] [f ix +1 ] [zi +β f i x +1 ]f i x +1 i I + β [f ix +1 ] i I z i β z i max z i β,f i x +1 + ] [ z i β = β ρ z +1 z, 8 z where the sets I+ and I are defined in 4. In addition, if f ix 0, i [m], then Ψ β x,z 0. Hence, plugging 3 and 8 into 7 yields 5. By Lemma 1 and Theorem 3, we have the following convergence rate estimate of Algorithm 1. Theorem 4 Ergodic convergence rate of ialm Under Assumptions 1 and, let {x,y,z } be the sequence generated from Algorithm 1 with y 0 = 0,z 0 = 0 and 0 < ρ β,. Then f0 x K f 0 x 1 t=0 ρ t i I y + z + ρ ε 7, 9a A x K b + [f x K ] + 1 1+ y t=0 ρ + 1+ z + ρ ε, 9b t where x K = t=0 ρ tx t+1 t=0 ρ. 30 t

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 11 Proof. Since ρ β, the two terms β ρ r +1 and β ρ z +1 z are nonnegative. Dropping them ρ and multiplying ρ to both sides of 5 yields ] m ρ [f 0 x +1 f 0 x+y r +1 + z i f i x +1 + 1 y+1 y + 1 z+1 z 1 y y + 1 z z +ρ ε. 31 Summing up 5 with x = x gives ρ [f 0 x +1 f 0 x +y r +1 + ] m z i f i x +1 + 1 yk y + 1 zk z 1 y0 y + 1 z0 z + ρ ε. 3 By the convexity of f i s and the nonnegativity of z, we have which together with 3 implies f 0 x K f 0 x +y A x K b+ m z i f i x K 1 t=0 ρ ρ [f 0 x +1 f 0 x +y r +1 + t ] m z i f i x +1, m f 0 x K f 0 x +y A x K b+ z i f i x K 1 1 t=0 ρ t y + 1 z + ρ ε. The results thus follow from Lemma 1 with α = ρ ε ρ, c 1 = 1 ρ, c = 1 ρ, and we complete the proof. Remar 1 Notethatifρ ρ > 0, and{ε }issummable,then asublinearconvergenceresultfollowsfrom 9. The wor [31] has also analyzed the convergence of Algorithm 1 through the augmented dual function d β. However, it requires =1 ε <, which is strictly stronger than the condition =1 ε <.

1 Yangyang Xu 3. Iteration complexity of ialm In this subsection, we apply Nesterov s optimal first-order method to each x-subproblem 7 and estimate the total number of gradient evaluations to produce an ε-optimal solution of 1. Note that the convergence rate results in Theorem 4 do not assume specific structures of 1 except convexity. If the problem 1 has richer structures than those in 3, more efficient methods can be applied to the subproblems in 7. The following results are easy to show from the Lipschitz differentiability of g and f i, i [m]. Proposition 1 Assume 3a,3b,and the boundedness of domh X. Then there exist constantsb 1,...,B m such that Let the smooth part of L β be denoted as max f i x, f i x B i, x domh X, i [m], 33a f i ˆx f i x B i ˆx x, ˆx, x domh X, i [m]. 33b F β x,y,z = L β x,y,z hx. Based on 33, we are able to show Lipschitz continuity of x F β x,y,z with respect to x for every y,z. Lemma 4 Assume 3a, 3b, and the boundedness of domh X. Let B i s be given in Proposition 1. Then x F β x,y,z is Lipschitz continuous on domh X in terms of x with constant Lz = L 0 +β A A + m β B i B i +L i +L i zi. 34 Proof. For ease of description, we let β = β and y,z = y,z in the proof. First we notice that u ψ βu,v = [βu+v] +, and thus for any v, u ψ βû,v u ψ βũ,v β û ũ, û,ũ. Let h i x,z i = ψ β f i x,z i, i = 1,...,m. Then x h i ˆx,z i x h i x,z i = u ψ βf i ˆx,z i f i ˆx u ψ βf i x,z i f i x u ψ βf i ˆx,z i f i ˆx u ψ βf i x,z i f i ˆx + u ψ βf i x,z i f i ˆx u ψ βf i x,z i f i x β f i ˆx f i x f i ˆx + u ψ βf i x,z i fi ˆx f i x βb i ˆx x +L iβb i + z i ˆx x.

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 13 Hence, x F β ˆx,y,z x F β x,y,z m gˆx g x +β A Aˆx x + x h i ˆx,z i x h i x,z i m L 0 +β A [ A + βb i +L i βb i + z i ] ˆx x, which completes the proof. Therefore, letting φx = F β x,y,z and ψx = hx + ι X x, we can apply Nesterov s optimal first-order method in Algorithm to find x +1 in 7. From Theorem, we have the following results. Note that if the strong convexity constant µ = 0, the problem is just convex. Lemma 5 Assume that g is strongly convex with modulus µ 0. Given ε > 0, if we start from x and run Algorithm, then at most t iterations are needed to produce x +1 such that 7 holds, where distx,x Lz, if µ = 0, ε t = log Lz +µ ε [distx,x ] log1/ 1 µ Lz and X denotes the set of optimal solutions to min x X L β x,y,z., if µ > 0, Below we specify the sequences {β },{ρ } and {ε } for a given ε > 0, and through combining Theorem 4 and Lemma 5, we give the iteration complexity results of ialm for producing an ε-optimal solution. We study two cases. In the first case, a constant penalty parameter is used, and in the second case, we geometrically increase β and ρ. Given ε > 0, we set {β } and {ρ } according to one of the follows: Setting 1 constant penalty Let K be a positive integer number and C β a positive real number. Set 35 ρ = β = C β, 0 < K. Kε Setting geometrically increasing penalty Let K be a positive integer number, C β a positive real number, and σ > 1. Set β g = C β σ 1 ε σ K 1, 36 and ρ = β = β g σ, 0 < K.

14 Yangyang Xu Note that if K = 1, the above two settings are the same, and in this case, Algorithm 1 simply reduces to the quadratic penalty method. For either of the above two settings, we have ρ = C β ε. From 34, we see that the Lipschitz constant depends on z. Hence, from 35, to solve the x-subproblem to the accuracy ε, the number of gradient evaluations will depend on z. Below we show that if ε is sufficiently small, z can be bounded and thus so is Lz. Lemma 6 Let {x,y,z } K be the sequence generated from Algorithm 1 with {β } and {ρ } set according to either Setting 1 or Setting. If y 0 = 0,z 0 = 0, and ε s are chosen such that then ρ ε C ε, 37 where and l is given in 10. H = A A + m Lz L +β H, 0 K, 38 B i B i +L i, L = L 0 + l y + z + C ε Proof. Letting x,y,z = x,y,z in 31 and using 1, we have 1 y+1 y + 1 z+1 z 1 y y + 1 z z +ρ ε. Summing the above inequality yields which implies 1 y y + 1 z z 1 y0 y + 1 1 z0 z + ρ t ε t, 0 K, t=0 1 z z + y0 y + z 0 z + ρ t ε t. Since u u 1 for any vector u, we have from the above inequality that z z + y 0 y + z 0 z + ρ t ε t, 0 K. 39 Hence, if y 0 = 0 and z 0 = 0, and 37 holds, it follows from the above inequality that t=0 z y + z + C ε, 0 K, 40 By the Cauchy-Schwartz inequality, we have from 34 that for any 0 K, Lz L 0 +β H + z l t=0

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 15 which together with 40 gives the result in 38. Optimal subproblem accuracy parameters. If t gradient evaluations are required to produce x +1, then the total number of gradient evaluations is T K = t to generate {x } K =1. Given ε > 0, and {β },{ρ } set according to either Setting 1 or Setting, we can choose {ε } to minimize T K subject to the condition in 37. When µ = 0, we solve the following problem: minimize ε>0 distx,x Lz, s.t. ε β ε C ε, where ε = [ε 0,...,ε ]. Through formulating the KKT system of the above problem, one can easily find the optimal ε given by ε = C ε When µ > 0, we solve the problem below: minimize ε>0 whose optimal solution is given by [distx,x ] 3[Lz ] 1 3 β 3 t=0 β 1 3 t [distx t,x t ] 3[Lz t ] 1 3 Lz Lz +µ log [distx,x µ ε ], s.t. ε = C ε, 0 < K. 41 β ε C ε, 4 Lz, 0 < K. 43 β t=0 Lzt Note that the summand in the objective of 4 is not exactly the same as that in the second inequality of 35. They are close if µ Lz since log1+a = a+oa. The optimal ε given in 41 and 43 depends on distx,x and the future points z+1,...,z, which are unnown at iteration. We do not assume these unnowns. Instead, we set ε in two different ways. One way is to simply set for both cases of µ = 0 and µ > 0. Another way is to let ε = C ε C ε ε = ε, 0 < K, 44 C β 1 β 1 3 t=0 β 3 t, 0 < K, 45 for the case of µ = 0, and ε = C ε 1, 0 < K, 46 β t=0 βt for the case of µ > 0. We see that if β H dominates L and distx,x is roughly the same for all s, then {ε } in 45 and 46 well approximate those in 41 and 43. If {β } and {ρ } are set according to Setting 1, i.e., constant parameters, then the {ε } in both 45 and 46 is constant as in 44. Plugging these parameters into 35, we have the following estimates on the total number of gradient evaluations.

16 Yangyang Xu Theorem 5 Iteration complexity with constant penalty and constant error For any given ε > 0, let K be a positive integer number and C β,c ε two positive real numbers. Set {β } and {ρ } according to Setting 1 and {ε } by 44. Let x K be given in 30. Then f 0 x K f 0 x ε y + z + ε C β A x K b + [f x K ] + ε[ 1+ y +1+ z ] + ε C β C ε C β, C ε C β. 47a 47b Assume µ L0 4. Then Algorithm 1 can produce xk by evaluating gradients of g,f i, i [m] in at most T K times, where Cβ L T K DK ε + 1 Cβ H +K, if µ = 0, 48 ε K C ε and T K K L µ + C β H µkε D C β L +µ log + C βh C ε ε Kε +K, if µ > 0. 49 Proof. The results in 47 directly follows from Theorem 4 and the settings of {β }, {ρ }, and {ε }. For the total number of gradient evaluations, we use the inequalities in 35. First, for the case of µ = 0, from the first inequality of 35 and the parameter setting, it follows that the total number of gradient evaluations T K distx,x L + C βh Kε +K. 50 ε/ Cε /C β Since a+b a+ b for any two nonnegative numbers a,b, we have from the above inequality and by noting distx,x D that T K D Cβ C ε C L + β H Kε ε +K = DK Cβ C ε L ε + 1 Cβ H +K, ε K which gives 48. For the case of µ > 0, we first note that for any 0 < a 1, it holds log1+a a a µ L0 4, we have µ Lz 4 and log µ/lz 1 µ/lz 1 1 µ/lz = log 1+ 1. Therefore, µ/lz 1 1 µ/lz µ/lz 1 µ/lz, a. Hence, if and thus 1 log 1 1 µ/lz Lz µ 1 µ/lz Lz µ. 51

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 17 Using the above inequality and the second inequality of 35, we have that the total number of gradient evaluations L + C βh Kε L + C βh Kε T K log +µ [distx,x µ εc ε /C ] +K. 5 β Since L + C βh Kε C L + β H Kε and distx,x D, the above inequality implies 49. This completes the proof. We mae two observations below about the results in Theorem 5. Remar From the error bounds in 47, we see that if C β max 4 y +4 z,1+ y +1+ z +C ε, 53 then x K is an ε-optimal solution. Otherwise, the errors in 47 are multiples of ε. If we represent ε by the total number t of gradient evaluations, we can obtain the convergence rate result in terms of t. Let C β = C ε L and K = 1 in 48. Then the total number of gradient evaluations is about t = D ε + ε 1 Cβ H. By quadratic formula, one can easily show that ε = D L + L D +Dt C β H t 4L D t + 4D Cβ H. t Let ˆx t = x K to specify the dependence of the iterate on the number of gradient evaluations. Plugging the above ε into 47, we have f 0 ˆx t f 0 x y + z + 1 4L D C β t + 4D Cβ H, 54a t Aˆx t b + [fˆx t ] + 1+ y +1+ z + 1 4L D C β t + 4D Cβ H. 54b t If there are no equality or inequality constraints, then H = 0, y = 0,z = 0, and the rate of convergence in 54a matches with the optimal one in 19; if the objective f 0 x 0 and there areno inequality constraints, thenh = A A,y = 0,z = 0,L = 0,andthe rateofconvergencewith C β = in54broughlybecomes Aˆx t b 8 D A A t, whose order is also optimal. Therefore, the order of convergencerate in 54 is optimal, and so is the iteration complexity result in 48. For the strongly convex case, if there are no equality or inequality constraints, the iteration complexity result in 49 is optimal by comparing it to 0. With the existence of constraints and nonsmooth term in the objective, the order 1/ ε also appears to be optimal and is the best we can find in the literature; see the discussion in section 5.

18 Yangyang Xu Remar 3 From both 48 and 49, we see that T 1 T K, K 1, i.e., K = 1 is the best. Note that if y 0 = 0,z 0 = 0, and K = 1, Algorithm 1 reduces to the quadratic penalty method by solving a single penalty problem. However, practically K > 1 could be better since distx,x usually decreases as increases. Hence, from 50 or 5, T K can be smaller than T 1 if K > 1; see our numerical results in Table 1. Theorem 6 Iteration complexity with geometrically increasing penalty and constant error For any given ε > 0, let K be a positive integer number and C β,c ε two positive real numbers. Set {β } and {ρ } according to Setting and {ε } to 44. Assume µ L0 4. Let xk be given in 30. Then the inequalities in 47 hold, and Algorithm 1 can produce x K by evaluating gradients of g,f i, i [m] in at most T K times, where and where T K T K D Cβ C ε G ε K K L ε + Cβ Hσ 1 ε +K, if µ = 0, 55 σ 1 L µ + H Cβ σ 1 +K, if µ > 0. 56 µ ε σ 1 G ε = log C βd +log L +µ+ H C β σ 1+β g ε. εc ε σε Proof. When µ = 0, we have from the first inequality in 35 that the total number of gradient evaluations satisfies distx,x T K L +β H +K. 57 ε Plugging into 57 the ε given in 44 and noting distx,x D yields T K D Cβ C ε Note that β = σ β K 1 g σ 1. From 36, it holds and thus σ K 1 Cβ σ 1 β gε. Therefore, σ K = C βσ 1 β g ε and using L +β H L + β H, we have L +β H L +β H ε +K. 58 +1, 59 β Cβ σ 1 ε σ 1, 60 L + β H K L + Cβ Hσ 1 ε σ 1, 61

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 19 which together with 58 gives 55. For the strongly convex case, we use 51 and the second inequality of 35 to have T K L +β H L +β H +µ log [distx,x µ ε ] +K. 6 Since distx,x D and ε s are set to those in 44, the above inequality indicates For 0 < K, T K L +β H log C βd L +β H +µ +K. 63 µ εc ε β β = β g σ = β g 59 σk = β g Cβ σ 1 +1 σ σ β g ε = C βσ 1+β g ε. 64 σε Plugging into 63 the second inequality in 61 and the above bound on β, we have 56 and thus complete the proof. Remar 4 Comparingthe iteration complexity results in Theorems 5 and 6, we see that ifk = 1, the number T K in either case of µ = 0 or µ > 0 is the same for both penalty parameter settings as σ. That is because when K = 1, ialm with either of the two settings reduces to the penalty method. If K > 1, the number T K for the setting of geometrically increasing penalty can be smaller than that for the constant parameter setting as σ is big; see numerical results in section 6. Theorem 7 Iteration complexity with geometrically increasing penalty and adaptive error For any given ε > 0, let K be a positive integer number and C β,c ε two positive real numbers. Set {β } and {ρ } according to Setting. Assume µ L0 4. If µ = 0, set {ε } as in 45, and if µ > 0, set {ε } as in 46. Let x K be given in 30. Then the inequalities in 47 hold, and Algorithm 1 can produce x K by evaluating gradients of g,f i, i [m] in at most T K times, where T K D Cβ L C ε ε σ 1 1 σ 1 6 1σ 3 1 1 HCβ σ 1 + +K, if µ = 0, 65 εσ 3 1 3 and where T K G ε K L µ + H Cβ σ 1 +K, if µ > 0. 66 µ ε σ 1 G ε = log C βd +log L +µ+ H C β σ 1+β g ε σ 1 +β g εσ 1/C β +log εc ε σε σ. σ

0 Yangyang Xu Proof. For the case of µ = 0, we have 57, plugging into which the ε given in 45 yields T K distx,xβ 1 6 L +β H 1 +K. Cε t=0 β 3 t Since distx,x D, the above inequality implies T K D Note that and t=0 β 1 6 L +β H 1 Hence, it follows from 67 that Cε β 3 t = T K D β 1 σ K 3 1 3 g Cε σ 3 1 t=0 β 3 t t=0 β 1 6 L + β H β 1 6 L +β H 1 +K. 67 β 3 g σ t 3 = β 1 σ K 3 1 3 g σ 3 1, L β 1 σ K 6 1 6 g = L From 59 and the fact a+b a+ b, a,b 0, it follows that σ K Cβ σ 1 3 1 β g ε 3, σ K β 1 6 + H = L β 1 σ K 6 1 6 g σ 1 6 1 + Hβ 3 g β 3 σ K 3 1 σ 3 1. σ 1 6 1 + Hβ σ K 3 1 3 g +K. 68 σ 3 1 6 1 Cβ σ 1 β g ε 1 6. 69 Therefore, plugging the two inequalities in 69 into 68 yields 65. Forthe caseofµ > 0, we have6. Since distx,x D and ε s areset to those in 46, the inequality in 6 indicates T K = L +β H µ L +β H µ log log β t=0 L +β H +µ βt D +K C ε β D t=0 βt +log L +β H +µ C ε +K. 70 Therefore, plugging into 70 the inequality in 60, the upper bounds of L +β H and β in 61 and 64 respectively, we obtain 66 and complete the proof.

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 1 Remar 5 Let us compare the iteration complexity results in Theorems 6 and 7. We see that for the case L HCβ of µ = 0, as K > 1 and σ is big, if ε dominates ε, the iteration complexity result in Theorem 7 is HCβ L ε, the two better than that in Theorem 6 see the numerical results in Table, and if ε dominates results are similar. For the case of µ > 0, as K > 1, the iteration complexity result in Theorem 6 is better than that in Theorem 7. 4 Nonergodic convergence rate and iteration complexity In this section, we show a nonergodic convergence rate result of Algorithm 1, by employing the relation between ialm and the inexact proximal point algorithm ippa. Throughout this section, we assume there is no affine equality constraint in 1, i.e., we consider the problem minimizef 0 x, s.t. f i x 0, i [m], 71 x X where f i,i = 0,1,...,m, satisfy the assumptions through 3b. We do not include affine equality constraints for the purpose of directly applying existing results in [30, 3]. Although results similar to those in [30,3] can possibly be shown for the equality and inequality constrained problem 1, we do not extend our discussion but instead formulate any affine equality constraint a x = b by two affine inequality constraints a x b 0 and a x+b 0 if there is any. 4.1 Relation between ialm and ippa Let L 0 x,z be the Lagrangian function of 71, namely, L 0 x,z = f 0 x+ m z i f i x, and let L β x,z be the augmented Lagrangian function of 71, defined in the same way as that in 6. In addition, let d 0 z be the Lagrangian dual function, defined as { minx X L 0 x,z, if z 0, d 0 z =, otherwise, and let d β z min x X L β x,z be the augmented Lagrangian dual function. Applying Algorithm 1 with ρ = β to 71, we have iterates {x,z } that satisfy: L β x +1,z d β z +ε, z +1 = z +β z L β x +1,z. 7a 7b The ippa applied to the Lagrangian dual problem max z d 0 z iteratively performs the updates: z +1 M β z, 73

Yangyang Xu where the operator M β is the proximal mapping of βd 0, defined as M β z = argmaxd 0 u 1 u β u z. In 73, the approximation could be measured by the objective error as in 7a or by the gradient norm at the returned point z +1 ; see [13] for example. It was noted in [30] that d β z = max d 0u 1 u β u z, 74 and in addition, if ˆx X satisfies L β ˆx,z d β z+ε, then z+β z L β ˆx,z Mz ialm with updates in 7 reduces to ippa in 73 with approximation error βε. Therefore, z +1 M β z β ε. 75 4. Nonergodic convergence rate of ialm For ialm with updates in 7 on solving 71, [3, Theorem 4] establishes the following bounds on the objective error and feasibility violation: If in 7a, ε = 0,, [1, Theorem.] shows that f 0 x +1 f 0 x ε + z z +1 β, 76a f i x +1 z i z+1 i β, i [m]. 76b z z +1 β z0 z t=0 β. 77 t Therefore, combining the results in 76 with ε = 0, and 77, and also noting the boundedness of z from 40, one can easily obtain a nonergodic convergence rate result of exact ALM on solving 71. However, if ε > 0, we do not notice any existing result on estimating z z +1 β. In the following, we establish a bound on this quantity and thus show a nonergodic convergence rate result of ialm. Lemma 7 Given a positive integer K and a nonnegative number C ε, choose positive sequences {β } and {ε } such that β ε Cε. Let {x,z } K be the sequence generated from the updates in 7 with z 0 = 0 on solving 71. Then where we have assumed that 71 has a primal-dual solution x,z. z z +1 5 z + 7 Cε, 78

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 3 Proof. Let z +1 = M β z. Then from 74, it follows that 1 β z +1 z = d 0 z +1 d β z. 79 By the wea duality, it holds d 0 z +1 f 0 x. From 7a, we have Recall the definition of ψ β u,v in 5 and note ψ β u,v v β. Hence, Thus by 80, it holds that and Hence, L β x +1,z = f 0 x +1 + d β z L β x +1,z ε. 80 m d β z f 0 x +1 z ψ β f i x +1,z i f 0 x +1 z β. β ε, d 0 z +1 d β z f 0 x f 0 x +1 + z Since x,z is a primal-dual solution of 71, similar to 1, it holds that f 0 x f 0 x + f 0 x f 0 x +1 Therefore, from 81 and the above inequality, it follows that and noting 79, we have β +ε. 81 m zif i x 0, x X. 8 m zi f ix +1 76b z z +1 z. d 0 z +1 d β z z z +1 β z + z β +ε, z +1 z = β d0 z +1 d β z z z +1 z + z +β ε. By the triangle inequality, we have from 75 and the above inequality that z z +1 β z z +1 z + z +β ε + z + z +1 z + z + 3 z + z +1 β ε β ε + z + z + 3 Cε, 83

4 Yangyang Xu where the last inequality uses the Young s inequality and the fact β ε Cε. Through the same arguments as those in Lemma 6, one can show z z + C ε, 0 K, 84 i.e., the inequality in 40 with y = 0. Plugging the above bound on z into 83 gives the desired result in 78. Combining 76 and 78, we are able to establish the nonergodic convergence rate result of ialm on solving 71. Theorem 8 nonergodic convergence rate Under the same assumptions of Lemma 7, it holds that for any 0 < K, f0 x +1 f 0 x ε + z + C ε β 5 z + 7 Cε, 85a [fx +1 ] + 1 5 z + 7 Cε. 85b β Proof. Directly from 76, 78, and 84, we obtain 85b and Using 8, we have from 85b that f 0 x +1 f 0 x ε + z + C ε β f 0 x +1 f 0 x z β 5 z + 7 Cε. 86 5 z + 7 Cε, which together with 86 gives 85a. Remar 6 From the results in 85, we see that to have {x } to be a minimizing sequence of 71, we need β and ε 0 as. Hence, setting {β } to a constant sequence will not be a valid option. 4.3 Iteration complexity In this subsection, we set parameters according to Setting, and we estimate the iteration complexity of ialm on solving 71 by applying Nesterov s optimal first-order method to 7a. Again, note that the results in Theorem 8 do not need specific structure of 71 except convexity. Hence, if the problem has richer structures, one can apply more efficient methods to find x +1 that satisfies 7a. Theorem 9 Nonergodiciteration complexity Given apositive integer K and positive numbers C β,c ε, choose positive sequences {ρ } and {β } according to Setting. In addition, choose {ε } according to 44 for both cases of µ = 0 and µ > 0, or choose {ε } according to 45 for the case of µ = 0 and 46 for µ > 0.

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 5 Let {x,z } K be the sequence generated from Algorithm 1 with y = 0,, and z 0 = 0 on solving 71. Then f 0 x K f 0 x ε C ε C β + [fx K ] + εσ C β σ 1 εσ z C β σ 1 + C ε 5 z + 7 Cε, 87a 5 z + 7 Cε. 87b If {ε } is chosen according to 44 for both cases of µ = 0 and µ > 0, the total number T K of gradient evaluations is given in 55 and 56 respectively; if {ε } is set according to 45 for the case of µ = 0 and 46 for µ > 0, then T K is given in 65 for µ = 0 and 66 for µ > 0. Proof. Note that β is increasing with respect to. Hence, the ε given in both 45 and 46 is decreasing, and thus t=0 ε β tε t t=0 β ε C ε. t C β If {ε } is chosen according to 44 for both cases of µ = 0 and µ > 0, then the above bound on ε obviously holds. In addition, from 64, we have β C βσ 1. εσ Therefore, plugging into 85 the bounds on ε and β gives the desired results in 87. The bounds on the total number T K of gradient evaluations follow from the same arguments as in the proofs of Theorems 6 and 7. Hence, we complete the proof. Remar 7 From the results in 87, we see that if C β C ε + σ z σ 1 + C ε 5 z + 7 Cε, 88 then x K is an ε-optimal solution to 71. If z 1, C ε = z, and σ σ 1 1 e.g., σ = 10 is often used, then the C β in 88 is roughly 10 times of that in 53 by assuming no affine constraint. For the iteration L complexity, if ε dominates H z ε, then the nonergodic result is roughly 10 times of the ergodic result H z for both convex and strongly convex cases. If ε dominates, then the former would be roughly 10 times of the latter for the convex case, but still roughly 10 times for the strongly convex case. However, in either case, both ergodic and nonergodic results have the same order of complexity. 5 Related wors and comparison with existing results In this section, we review related wors and compare them to our results. Our review and comparison focus on convex optimization, but note that ALM has also been popularly applied to nonconvex optimization problems; see [4 6] and the references therein.

6 Yangyang Xu Affinely constrained convex problems Several recent wors have established the convergence rate of ALM and its inexact version for affinely constrained convex problems: minimizef 0 x, s.t. Ax = b. 89 x X Assuming exact solution to every x-subproblem, [14] first shows O1/ convergence of ALM for smooth problems in terms of dual objective and then accelerates the rate to O1/ by applying Nesterov s extrapolation technique to the multiplier update. The results are extended to nonsmooth problems in [18] that uses similar technique. By adapting parameters, [35] establishes O1/ convergence of a linearized ALM in terms of primal objective and feasibility violation. The linearized ALM allows linearization to smooth part in the objective but still assumes exact solvability of x-subproblems. When the objective is strongly convex, [17] proves O1/ convergence of ialm with extrapolation technique applied to the multiplier update. It requires summable error and subproblems to be solved more and more accurately. However, it does not give an estimate on the total number of gradient evaluations on solving all subproblems to the required accuracies. For smooth linearly constrained convex problems, [0] analyzes the iteration complexity of the ialm. It applies Nesterov s optimal first-order method to every x-subproblem and shows that Oε 7 4 gradient evaluations are required to reach an ε-optimal solution. Compared to this complexity, our results for the convex case are better by an order Oε 4. 3 In addition, [0] modifies the ialm by solving a perturbed problem. The modified ialm requires Oε 1 logε 3 4 gradient evaluations to produce an ε-optimal solution, and this order is worse than our results by an order O logε 3 4. Motivated by the model predictive control, [3] also analyzes the iteration complexity of inexact dual gradient methods idgm that are essentially ialms.itshowsthattoreachanε-optimalsolution 4,anonacceleratediDGMrequiresOε 1 outeriterations and every x-subproblem solved to an accuracy Oε, and an accelerated idgm requires Oε 1 outer iterations and every x-subproblem solved to an accuracy Oε 3. While the iteration complexity in [0] is estimated based on the best iterate, and that in [3] is ergodic, the recent wor [1] establishes non-ergodic convergence of ialm. It requires Oε gradient evaluations to reach an ε-optimal primal-dual solution x,ȳ in the sense that A x b ε, g x+a ȳ, x x +h x hx ε, x, 90 where it is assumed that f 0 = g + h in 89 and g is Lipschitz differentiable. From the convexity of g, it follows from 90 that f 0 x f 0 x g x, x x +h x hx ε ȳ,a x b ε+ ȳ ε. Hence, if ȳ 0, to have an ε-optimal solution by our Definition 1, the iteration complexity result in [1] would be Oε 4, which is Oε 3 worse than our nonergodic iteration complexity result in Theorem 9. Another line of existing wors on ialm assume two or multiple bloc structure on the problem and simply perform one cycle of Gauss-Seidel update to the bloc variables or update one randomly selected bloc. Global sublinear convergence of these methods has also been established. Exhausting all such wors is impossible and out of scope of this paper. We refer interested readers to [7 10,15,8,36,37] and the references therein. 4 [3] assumes every subproblem solved to the condition L β x +1,y,x x +1 Oε, x X, which is implied by L β x +1,y min x X L β x,y Oε if L β is Lipschitz differentiable with respect to x.

Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming 7 General convex problems As there are nonlinear inequality constraints, we do not find any wor in the literature showing the global convergence rate of ialm, though its local convergence rate has been extensively studied e.g., [3, 30, 3]. Many existing wors on nonlinearly constrained convex problems employ Lagrangian function instead of the augmented one and establish global convergence rate through dual subgradient approach e.g., [, 4, 5]. For general convex problems, these methods enjoy O1/ convergence, and for strongly convex case, the rate can be improved to O1/. To achieve an ε-optimal solution, compared to our results, their iteration complexityisoε 1 timesworsefortheconvexproblemsandoε 1 worseforthestronglyconvexproblems. Assuming Lipschitz continuity of f i for every i [m], [39] proposes a new primal-dual type algorithm for nonlinearly constrained convex programs. Every iteration, it minimizes a proximal Lagrangian function and updates the multiplier in a novel way. With sufficiently large proximal parameter that depends on the Lipschitz constants of f i s, the algorithm converges in O1/ ergodic rate. The follow-up paper [38] focuses on smooth constrained convex problems and proposes a linearized variant of the algorithm in [39]. Assuming compactness of the set X, it also establishes O1/ ergodic convergence of the linearized method. Iteration complexity from existing results on ippa Through relating ialm and ippa, iteration complexity result can be obtained from existing results about ippa to produce near-optimal dual solution. On solving problem min z φz, [13] analyzes the ippa with iterative update: z +1 argminφz+ 1 z ẑ. z β If the above approximation error satisfies z +1 prox β φẑ = O1/ a, 91 for a certainnumber a > 1, and the parameterβ is increasing,then by choosingspecifically designed ẑ, [13] shows that φz φz = O1/ +O1/ a 1. From our discussion in section 4.1, if ε = O 1 a β in 7a, then we have 91 holds with φ = d 0, and thus obtain the convergence rate in terms of dual function: d 0 z d 0 z = O1/ +O1/ a 1. Note that z is bounded from the summability of β ε and the proof of Lemma 6. Hence, setting β to a constant for all and applying Nesterov s optimal first-order method to each subproblem in 7a, we need O a gradient evaluations. Let a = 3. Then K = O1/ ε ippa iterations are required to obtain an ε-optimal dual solution, i.e., d 0 z K d 0 z ε, and the total number of gradient evaluations is T K = K =1 O 3 = OK 5 = Oε 5 4. However, it is not clear how to measure the quality of the primal iterates.

8 Yangyang Xu 6 Numerical results In this section, we conduct numerical experiments on the quadratically constrained quadratic programming QCQP: 1 minimize x R n x Q 0 x+c 0 x, 1 s.t. x Q j x+c j x+d j 0, j = 1,...,m, x i [l i,u i ], i = 1,...,n. The purpose of the tests is to verify the established theoretical results and compare the ialm with three different settings of parameters. Three QCQP instances are made. The first two instances are convex, and the third one is strongly convex. For all three instances, we set n = 100,m = 5 and l i = 1,u i = 1, i. The vectors c j,j = 0,1,...,m are generated following Gaussian distribution, and the scalars d j,j = 1,...,m are made negative. This way, all inequalities in 9 hold strictly at the origin x = 0, and thus the KKT conditions are satisfied at the optimal solution. Q j,j = 0,1,...,m are randomly generated and symmetric positive semidefinite. Q 0 is ran-deficient for the first two instances and full-ran for the third one. The data in the first two instances are the same except Q 0, which is 100 times in the second instance as that in the first instance. For all tests, we set ε = 10 3, C β = 1, C ε = u l, and K = 10, and the initial primal-dual point is set to zero vector. The algorithm parameters {β,ρ,ε } are set in three different ways corresponding to Theorems 5, 6, and 7 respectively, where σ = 10 is used for the geometrically increasing penalty. On finding x +1 by applying Algorithm to min x X L β x,z, we terminate the algorithm if the iteration number exceeds 10 6 or dist x L β x +1,z,N X x +1 ε u l, 93 where X = n [l i,u i ]. Since L β x,z is convex about x, and u l is the diameter of the feasible set X, the condition in 93 guarantees that x +1 satisfies 7. We report the difference of objective value and optimal value, and the feasibility violation at both actual iterate x and the weighted averaged point x = t=1 xt / t=1 β t. The optimal solution is computed by CVX[11]. In addition, to compare the iteration complexity, we also report the number of gradient evaluations and function evaluations for each outer iteration. The results are provided in Tables 1,, and 3 respectively for the three instances. In Table 1, we also report the results from quadratic penalty method, which corresponds to setting K = 1 see the discussions in Remar 3. From the results, we can clearly see that the quadratic penalty method is worse, namely, running a single ialm step with a big penalty parameter is significantly worse than running multiple steps with smaller penalty parameters. Also, we see that the ialm with three different settings yields the last actual iterate x K and the averaged point x K of similar accuracy. For all three instances, to produce similarly accurate solutions, the ialm with constant penalty requires more gradient and function evaluations than that with geometrically increasing penalty. Furthermore, the ialm with geometrically increasing penalty and constant error requires fewest gradient and function evaluations on the first and third instances. However, the setting of geometrically increasing penalty and adaptive error is the best for ialm on the second instance. That is because the gradient Lipschitz constant of the objective in the second instance is significantly bigger than that in the first instance, in which case the bound on T K in 65 is smaller than that in 55. 9