On the acceleration of augmented Lagrangian method for linearly constrained optimization Bingsheng He and Xiaoming Yuan October, 2 Abstract. The classical augmented Lagrangian method (ALM plays a fundamental role in algorithmic development of constrained optimization. In this paper, we mainly show that Nesterov s influential acceleration techniques can be applied to accelerate ALM, thus yielding an accelerated ALM whose iteration-complexity is O(/ 2 for linearly constrained convex programming. As a by-product, we also show easily that the convergence rate of the original ALM is O(/. Keywords. Convex programming, augmented Lagrangian method, acceleration. Introduction The classical augmented Lagrangian method (ALM, or well-nown as the method of multipliers, has been playing a fundamental role in the algorithmic development of constrained optimization ever since its presence in [2] and [9]. The existing literature about ALM is too many to be listed, and we only refer to [, 8] for its comprehensive study. In this paper, we restrict our discussion into the convex minimization with linear equation constrains: (P min {f(x Ax = b, x X }, (. where f(x : R n R is a differentiable convex function, A R m n and b R m and X is a convex closed set in R n. Throughout we assume that the solution set of (. denoted by X is not empty. Note that the Lagrange function of the problem (. is L(x, λ = f(x λ T (Ax b, (.2 where λ R m is the Lagrange multiplier. Then, the dual problem of (. is (D max x X,λ R m L(x, λ s. t (x x T x L(x, λ, x X. (.3 We denote the solution set of (.3 by X Λ. As analyzed in [], the ALM merges the penalty idea with the primal-dual and Lagrangian philosophy, and each of its iteration consists of the tas of minimizing the augmented Lagrangian function of (. and the tas of updating the Lagrange multiplier. More specifically, starting with λ R m, the -th iteration of ALM for (. is { x + = Argmin {f(x (λ T (Ax b + β 2 Ax b 2 x X }, λ + = λ β(ax + (.4 b, Department of Mathematics and National Key Laboratory for Novel Software Technology, Naning University, Naning, 293, China. This author was supported by the NSFC Grant 9795 and the NSF of Jiangsu Province Grant BK28255. Email: hebma@nu.edu.cn Department of Mathematics, Hong Kong Baptist University, Hong Kong, China. This author was supported in part by HKRGC 239. Email: xmyuan@hbu.edu.h
where β > is the penalty parameter for the violation of the linear constraints. We refer to [] for the relevance of ALM with the classical proximal point algorithm, which was originally proposed in [4] and concretely developed in []. Note that among significant differences of ALM from penalty methods is that the penalty parameter β can be fixed and it is not necessary to be forced to infinity, see e.g. [8]. In this paper, we use a symmetric positive definite matrix H to denote the penalty parameter, indicating the eligibility of adusting values of this parameter dynamically even though the specific strategy of this adustment will not be addressed. More specifically, let {H } be a given series of m m symmetric positive definite matrices and it satisfy H H +, >. Then, the -th iteration of ALM with matrix penalty parameter for (. can be written as { x + = Argmin {f(x (λ T (Ax b + 2 Ax b 2 H x X }, (.5 λ + = λ H (Ax + b. Inspirited by the attractive analysis of iteration-complexity for some gradient methods initiated mainly by Nesterov (see e.g. [5, 6, 7], in this paper, we are interested in analyzing the iteration-complexity of the ALM and discussing the possibility of accelerating ALM with Nesterov s acceleration schemes. More specifically, in Section 2 we shall first show that the iteration-complexity of the ALM is O(/ in terms of the obective residual of the associated Lagrange function of (.. Then, in Section 3, with the acceleration scheme in [6], we propose an accelerated ALM whose iteration-complexity is O(/ 2. Finally, some conclusions are given in Section 4. 2 The complexity of ALM In this section, we mainly show that the iteration-complexity of the classical ALM is O(/ in terms of the obective residual of the associated Lagrange function L(x, λ defined in (.2. Before that, we need to ustify the rationale of estimating the convergence rate of ALM in terms of the obective residual of L(x, λ and prove some properties of the sequence generated by ALM which are critical for complexity analysis to be addressed later. According to (.3, a pair (x, λ X R m is dual feasible if and only if (x x T ( f(x A T λ, x X. (2. Note that the minimization tas regarding x + in the ALM scheme (.5 is characterized by the following variational inequality: (x x + T { f(x + A T λ + A T H (Ax + b }, x X. Therefore, substituting the λ + -related equation in (.5 into the above variational inequality, we have (x x + T { f(x + A T λ +}, x X. (2.2 In other words, the pair (x +, λ + generated by the -th iteration of ALM is feasible to the dual problem (.3. On the other hand, a solution (x, λ X Λ of (.3 is also feasible. We thus have that the sequence {L(x, λ L(x +, λ + } is non-negative. This explains the rationale of estimating the convergence rate of ALM in terms of the obective residual of L(x, λ. Now, we present some properties of the sequence generated by the ALM in the following lemmas. Despite that their proofs are elementary, these lemmas are critical for deriving the main results of iteration-complexity later. Lemma 2.. For given λ, let (x +, λ + be generated by the ALM (.5. feasible solution (x, λ of the dual problem (.3, we have Then, for any L(x +, λ + L(x, λ λ λ + 2 H + (λ λ T H (λ λ +. (2.3 2
Proof. First, using the convexity of f we obtain L(x +, λ + L(x, λ = f(x + f(x + λ T (Ax b (λ + T (Ax + b (x + x T f(x + λ T (Ax b (λ + T (Ax + b. (2.4 Since (x, λ is a feasible solution of the dual problem and x + X, set x = x + in (2., we obtain (x + x T f(x (x + x T A T λ = λ T A(x + x. Substituting the last inequality in the right-hand side of (2.4, we obtain L(x +, λ + L(x, λ λ T A(x + x + λ T (Ax b (λ + T (Ax + b The assertion of this lemma is proved. = (λ λ + T (Ax + b (using (.5 = (λ λ + T H (λ λ + = λ λ + 2 H + (λ λ T H (λ λ +. Lemma 2.2. For given λ, let λ + be generated by the ALM (.5. Then we have λ + λ 2 H λ λ 2 λ λ + 2 H H 2 ( L(x, λ L(x +, λ +, (x, λ X Λ. (2.5 Proof. Since (x, λ is dual feasible, by setting (x, λ = (x, λ in (2.3, we obtain (λ λ T H (λ λ + λ λ + 2 + ( L(x, λ L(x +, λ +. H Using the above inequality and by a manipulation, we obtain λ + λ 2 H = (λ λ (λ λ + 2 H = λ λ 2 H λ λ 2 H The assertion of this lemma is proved. 2(λ λ T H λ λ + 2 H The following theorem implies the global convergence of ALM (.5. (λ λ + + λ λ + 2 H 2 ( L(x, λ L(x +, λ +. Theorem 2.3. Let (x +, λ + be generated by the ALM (.5. Then for any, we have and Moreover, if H H, we have L(x +, λ + L(x, λ + λ λ + 2, (2.6 H λ + λ 2 λ λ 2 H H + λ λ + 2 H. (2.7 Ax + b 2 H Ax b 2 H A(x x + 2 H. (2.8 Proof. The first assertion (2.6 of this theorem is derived immediately from (2.3. Since L(x +, λ + L(x, λ, it follows from (2.5 that Because H + H λ + λ 2 H λ λ 2 H λ λ + 2 H., the second assertion (2.7 follows from the last inequality directly. 3
Now, we start to prove the third assertion (2.8. Setting x = x in (2.2, we obtain Similarly, we have (x x + T ( f(x + A T λ +. (x + x T ( f(x A T λ. Adding the above two inequalities and using the monotonicity of f, we obtain (x x + T A T (λ λ +. By using λ + = λ H (Ax + b (and the assumption H H, the last inequality that Using the above inequality in the identity we obtain (x x + T A T H(Ax + b. Ax b 2 H = Ax + b 2 H + A(x x + 2 H + 2(Ax + b T HA(x x +, and thus the third assertion (2.8 is proved. Ax b 2 H Ax + b 2 H + A(x x + 2 H, Remar 2.4. The inequality (2.7 essentially implies the global convergence of the ALM (.5 with dynamically-adusted matrix penalty parameter. In fact, it follows from (2.7 that which instantly implies that l= λ l λ l+ 2 H l λ λ 2 H, lim λ λ + 2 H =. In the following we show that the sequence of function value {L(x, λ } converges to the optimal value L(x, λ at a rate of convergence that is no worse than O(/. Hence, the iteration-complexity of the ALM (.5 is shown to be O(/ in terms of the obective residual of the Lagrange function L(x, λ. Theorem 2.5. Let (x, λ be generated by the ALM (.5. Then, for any, we have L(x, λ L(x, λ λ λ 2 H, (x, λ X Λ. (2.9 2 Proof. Due to H + H, it follows from Lemma 2.2 that, for all, we have 2(L(x +, λ + L(x, λ λ + λ 2 λ λ 2 H + H + λ λ + 2, (x, λ X Λ. H Using the fact that L(x +, λ + L(x, λ and summing the above inequality over =,...,, we obtain ( 2 = L(x +, λ + L(x, λ λ λ 2 H λ λ 2 H + By using Lemma 2. for = ( and (x, λ = (x, λ, we get = L(x +, λ + L(x, λ λ λ + 2. H 4 λ λ + 2. (2. H
Multiplying the last inequality by 2 and summing over =,...,, it follows that ( 2 ( + L(x +, λ + L(x, λ L(x +, λ + 2 λ λ + 2, H = which can be simplified into ( 2 L(x, λ Adding (2. and (2., we get = 2 ( L(x, λ L(x, λ λ λ 2 H and hence it follows that L(x +, λ + = λ λ 2 H = 2 λ λ + 2. (2. H (2 + λ λ + 2, H + = The proof is complete. L(x, λ L(x, λ λ λ 2 H. 2 3 An accelerated ALM In this section, we show that the classical ALM (.5 can be accelerated by some influential acceleration techniques initialized by Nesterov in [6]. As a result, an accelerated ALM with the convergence rate O(/ 2 for solving (.3 is proposed. For the convenience of presenting the accelerated ALM, from now on we use ( x, λ, rather than (x +, λ +, to denote the iterate generated by the ALM scheme (.5. Namely, with the given λ, the new iterate generated by ALM for (. is ( x, λ : { x = Argmin {f(x (λ T (Ax b + 2 Ax b 2 H x X }, (3. λ = λ H (A x b. Accordingly, Lemmas 2. and 2.2 can be rewritten into the following lemmas. Lemma 3.. For given λ, let ( x, λ be generated by the ALM (3.. Then, for any feasible solution (x, λ of the dual problem (.3, we have L( x, λ L(x, λ λ λ 2 H + (λ λ T H (λ λ. (3.2 Lemma 3.2. For given λ, let ( x, λ be generated by the ALM (3.. Then we have λ λ 2 H λ λ 2 H λ λ 2 H 2 ( L(x, λ L( x, λ, (x, λ X Λ. (3.3 Then, the accelerated ALM for (. is as follows. An accelerated augmented Lagrangian method (AALM Step. Tae λ R m. Set λ = λ and t =. Step. Let ( x, λ be generated by the original ALM (3.. Set and t + = + + 4t 2, (3.4a 2 λ + = λ ( t + ( λ t λ ( t + ( λ λ. + t + (3.4b 5
We first propose some lemmas before the main result. Lemma 3.3. The sequence {t } generated by (3.4a with t = satisfies Proof. Elementary by induction. For the coming analysis, we use the notations t ( + /2,. (3.5 v := L(x, λ L( x, λ and u := t (2 λ λ λ + λ λ. (3.6 Lemma 3.4. The sequences {λ } and { λ } generated by the proposed AALM satisfy where v and u are defined in (3.6. 4t 2 v 4t 2 +v + u + 2 u 2,, (3.7 H + H + Proof. By using Lemma 3. for +, setting (x, λ = ( x, λ and (x, λ = (x, λ, we get and L( x +, λ + L( x, λ λ + λ + 2 + ( λ λ + T H H + (λ+ λ +, + L( x +, λ + L(x, λ λ + λ + 2 + (λ λ + T H H + (λ+ λ +, + respectively. Using the definition of v, the last two inequalities can be written as and v v + λ + λ + 2 + ( λ λ + T H H + (λ+ λ +, (3.8 + v + λ + λ + 2 + (λ λ + T H H + (λ+ λ +. (3.9 + To get a relation between v and v +, we multiply (3.8 by (t + and add it to (3.9: (t + v t + v + t + λ + λ + 2 + ( λ + λ + T H H +( t+ λ + (t + λ λ. + Multiplying the last inequality by t + and using which yields t 2 = t 2 + t + ( and thus t+ = ( + + 4t 2 /2 as in (3.4a, t 2 v t 2 +v + t + ( λ + λ + 2 + t H + ( λ + λ + T H ( + t+ λ + (t + λ λ + = ( t + ( λ + λ + T H +( t+ λ+ (t + λ λ. (3. Use the identity (b a T H + (b c = 4 (2b a c 2 H + 4 a c 2 H + (since x T y = 4 x + y 2 4 x y 2 to the right-hand side of (3. with we get a := t + λ +, b := t + λ+, c := (t + λ + λ, t 2 v t 2 +v + 4 t +(2 λ + λ + λ + λ λ 2 H + 4 t +(λ + λ + λ λ 2. H + 6
Using the notation of u := t (2 λ λ λ + λ λ (see (3.6, the last inequality can be written as 4t 2 v 4t 2 +v + u + 2 t H + (λ + λ + λ λ 2. (3. + H + In order to write the inequality (3. in the form (3.7, we need only to set t + (λ + λ + λ λ = t (2 λ λ λ + λ λ. From the last equality we obtain λ + = λ ( t + ( λ t λ ( t + ( λ λ. + t + This is ust the form (3.4b in the accelerated multi-step version of the ALM and the lemma is proved. Corollary 3.5. Let v and u be defined in (3.6. Then, we have Proof. Again, because H 4t 2 v 4t 2 v + u 2,. (3.2 H + H, from (3.7 we obtain 4t 2 v 4t 2 +v + u + 2 H + u 2. H Since {v } is a non-negative sequence, the last inequality implies (3.2 immediately. Now, we are ready to show the fact that the iteration-complexity of the proposed AALM is O(/ 2. Theorem 3.6. Let { λ } and {λ } be generated by the proposed AALM. Then, for any, we have L(x, λ L( x, λ λ λ 2 H ( + 2, (x, λ X Λ. (3.3 Proof. Using the definition of v in (3.6, it follows from (3.2 that L(x, λ L( x, λ = v Combining with the fact t ( + /2 (see (3.5, it yields 4t 2 v + u 2 H 4t 2. L(x, λ L( x, λ Since t =, and using the definition of u given in (3.6, we have 4t 2 v + u 2 H ( + 2. (3.4 4t 2 v = 4v = 4 ( L(x, λ L( x, λ, By using (3.3, we have u 2 H = 2 λ λ λ 2. (3.5 H 4(L(x, λ L( x, λ 2 λ λ 2 H Use the identity 2 λ λ 2 H 2 λ λ 2. (3.6 H 2 a c 2 2 b c 2 2 b a 2 = a c 2 (b a + (b c 2 to the right-hand side of (3.6 with a := λ, b := λ, c := λ, 7
we get 4(L(x, λ L( x, λ λ λ 2 H Consequently, it follows from (3.5 and (3.7 that 4t 2 v + u 2 H Substituting it in (3.4, the assertion is proved. λ λ 2. H 2 λ λ λ 2. (3.7 H According to Theorem 3.6, for obtaining an ε-optimal solution of (.3 (denoted by ( x, λ in the sense that L(x, λ L( x, λ ε, the number of iterations required by the proposed accelerated ALM is at most C/ ε where C = λ λ 2. H 4 Conclusions In this paper, we first show that the iteration-complexity of the classical augmented Lagrangian method (ALM is O(/ for solving linearly constrained convex programming. Then, we show that the ALM can be accelerated by applying Nesterov s acceleration techniques, and the iterationcomplexity of the yielded accelerated ALM is O(/ 2. In the future, we will investigate (a the complexity of inexact ALM where the subproblems are solved approximately subect to certain criteria, as [3]; (b the complexity of some ALM-based methods, e.g. the well-nown alternating direction method for solving separable convex programming with linear constraints. References [] D. P. Bertseas, Constrained Optimization and Lagrange Multiplier Method, Academic Press, New Yor, 982. [2] M. R. Hestenes, Multiplier and gradient methods, J. Optim. Theory Appli., 4 (969, pp. 33-32. [3] G. H. Lan and R. D. C. Monteiro, Iteration-complexity of first-order augmented Lagrangian methods for convex programming, manuscript, 29. [4] B. Martinet, Regularisation, d inéquations variationelles par approximations succesives, Rev. Francaise d Inform. Recherche Oper., 4 (97, pp. 54-59. [5] A. S. Nemirovsy and D. B. Yudin, Problem Complexity and Method Efficiency in Optimization, Wiley-Interscience Series in Discrete Mathematics, John Wiley & Sons, New Yor, 983. [6] Y. E. Nesterov, A method for solving the convex programming problem with convergence rate O(/ 2, Dol. Aad. Nau SSSR, 269 (983, pp. 543-547. [7] Y. E. Nesterov, Gradient methods for minimizing composite obective function, CORE report 27; available at http://www.ecore.be/dps/dp-933936.pdf. [8] J. Nocedal and S. J. Wright, Numerical Optimization, Springer Verlag, 999. [9] M. J. D. Powell, A method for nonlinear constraints in minimization problems, In Optimization edited by R. Fletcher, pp. 283-298, Academic Press, New Yor, 969. [] R.T. Rocafellar, Augmented Lagrangians and applications of the proximal point algorithm in convex programming, Math. Oper. Res. (976, pp. 97-6. [] R.T. Rocafellar, Monotone operators and the proximal point algorithm, SIAM, J. Control Optim. 4 (976, pp. 877-898. 8