Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP) problems whose feasible region is a simple compact convex set intersected with the inverse image of a closed convex cone under an affine transformation. We study two first-order penalty methods for solving the above class of problems, namely: the quadratic penalty method and the exact penalty method. In addition to one or two gradient evaluations, an iteration of these methods requires one or two projections onto the simple convex set. We establish the iteration-complexity bounds for these methods to obtain two types of near optimal solutions, namely: near primal and near primaldual optimal solutions. Finally, we present variants, with possibly better iteration-complexity bounds than the aforementioned methods, which consist of applying penalty-based methods to the perturbed problem obtained by adding a suitable perturbation term to the objective function of the original CP problem. Keywords: Convex programming, quadratic penalty method, exact penalty method, Lagrange multiplier AMS 2000 subject classification: 90C25, 90C06, 90C22, 49M37 1 Introduction The basic problem of interest in this paper is the convex programming CP) problem f := inf{fx) : Ax) K, x X}, 1) where f : X IR is a convex function with Lipschitz continuous gradient, X R n is a sufficiently simple closed convex set, A : R n R m is an affine function, and K denotes the dual cone of a closed convex cone K R m, i.e., K := {s R m : s, x 0, x K}. For the case where the feasible region consists only of the set X, or equivalently A 0, Nesterov [2, 4]) developed a method which finds a point x X such that fx) f ɛ in at most Oɛ 1/2 ) The wor of both authors were partially supported by NSF Grants CCF-0430644 and CCF-0808863 and ONR Grant N00014-08-1-0033. School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. email: glan@isye.gatech.edu). School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA, 30332-0205. email: monteiro@isye.gatech.edu). 1

iterations. Moreover, each iteration of his method requires one gradient evaluation of f and computation of two projections onto X. It is shown that his method achieves, uniformly in the dimension, the lower bound on the number of iterations for minimizing convex functions with Lipschitz continuous gradient over a closed convex set. When A is not identically 0, Nesterov s optimal method can still be applied directly to problem 1) but this approach would require the computation of projections onto the feasible region X {x : Ax) K }, which for most practical problems is as expensive as solving the original problem itself. In this paper, we are interested in alternative first-order methods for 1) whose iterations consist of, in addition to a couple of gradient evaluations, only projections onto the simple set X. More specifically, we present two penalty-based approaches for solving 1), namely: the quadratic and the exact penalty methods, and establish their iteration-complexity bounds for computing two types of near optimal solutions of 1). It is well-nown that the penalty parameters of the penalization problems for the above penalty-based approaches must be chosen sufficiently large so that near optimal solutions of the penalization problems yield near optimal solutions for the original problem 1). Accordingly, we establish theoretical lower bounds on the penalty parameters that depend not only on the desired solution accuracies but also on the size λ of the minimum norm Lagrange multiplier associated with the constraint Ax) K. Theoretically, setting the penalty parameters to their corresponding lower bounds would yield the lowest provably iteration-complexity bounds, and hence would be the best strategy to pursue. But since λ is not nown a priori, we present alternative penalty-based approaches based on simple guess-and-chec procedures for these penalty parameters whose iteration-complexity bounds are of the same order as the approach in which λ is nown as priori, or equivalently, the theoretical lower bound on the penalty parameter is available. Finally, we present variants of the aforementioned methods, with possibly better iteration-complexity bounds, which consist of applying penalty-based methods to the problem obtained by adding a suitable perturbation term to the objective function of the original CP problem. The paper is organized as follows. In Section 2, we give a more detailed description of problem 1) and definitions of two types of approximate solutions of 1). Section 3 discusses some technical results that will be used in our analysis and reviews two first order algorithms due to Nesterov [2, 4] for solving special classes of CP problems. In Section 4, we establish the iteration-complexity of quadratic penalty based methods for solving 1). More specifically, we present the iterationcomplexity of the quadratic penalty method applied to 1) in Subsection 4.1, and of a variant, which consists of applying the quadratic penalty approach to the perturbed problem obtained by adding a suitable perturbation term to the objective function of 1), in Subsection 4.2. In Section 5, we establish the iteration-complexity of exact penalty based methods for solving 1). More specifically, we present the iteration-complexity of the exact penalty applied directly to 1) in Subsection 5.1, and of a variant, which consists of applying the exact penalty method to the perturbed problem corresponding to 1), in Subsection 5.2. 1.1 Notation and terminology We denote the p-dimensional Euclidean space by IR p. Also, IR p + and IRp ++ denote the nonnegative and the positive orthants of IR p, respectively. In this paper, we use the notation R p to denote a p- dimensional vector space inherited with a inner product space,. The dual norm associated with an arbitrary norm on R p is defined as s := max x { s, x : x 1}, s Rp. 2

Note that if the norm is the one induced by the inner product, i.e. =, 1/2, then. Given a closed convex set C R p, we define the distance function d C : R p IR to C with respect to as d C u) := min{ u c : c C} for every u R p. It is well-nown that this minimum is always achieved at some c C. Moreover, this minimizer is unique whenever is a inner product norm. In such a case, we denote this unique minimizer by Π C u), i.e., Π C u) = argmin{ u c : c C} for every u R p. The support function of a set C R p is defined as σ C u) := sup{ u, c : c C}. 2 Problem of interest Consider inner product spaces R n and R m endowed with arbitrary norms, both denoted simply by. In this paper, we consider the CP problem 1) where f : X IR is a convex function with L f -Lipschitz-continuous gradient, i.e.: f x) fx) L f x x, x, x X. 2) We define the norm of the map A as being the operator norm of its linear part A 0 := A ) A0), i.e.: A := A 0 = max{ A 0 x) : x 1} = max{ Ax) A0) : x 1}. The Lagrangian dual function and value function associated with 1) are defined as dλ) := inf{fx) + λ, Ax) : x X}, λ K, 3) vu) := inf{fx) : Ax) + u K, x X}, u R m. 4) We mae the following assumptions throughout the paper: Assumption 1 A.1) the set X is bounded and hence f IR); A.2) there exists a Lagrange multiplier for 1), i.e., a vector λ K such that f = dλ ). Throughout this paper, we assume that the norm in R m is induced by the inner pr. First note that x is an optimal solution of 1) if, and only if, x X, Ax ) K and fx ) f. This observation leads us to our first definition of a near optimal solution x X of 1), which essentially requires the primal infeasibility measure d K A x)) and the primal optimality gap [f x) f ] + to be both small. Definition 1 For a given pair ɛ p, ɛ o ) IR ++ IR ++, x X is called an ɛ p, ɛ o )-primal solution for 1) if d K A x)) ɛ p and f x) f ɛ 0. 5) One drawbac of the above notion of near optimality of x is that it says nothing about the size of [f x) f ]. We will now show that this quantity can be bounded as [f x) f ] ɛ p λ, where λ is an arbitrary Lagrange multiplier for 1). We start by stating the following result whose proof is given in the Appendix. Proposition 1 Let K R m be a closed convex cone. Then, the following statements hold: a) d K = σ C, where C := K) B0, 1), where B0, 1) := {u R m : u 1}; 3

b) for every u R m and λ K, we have u, λ λ d K u). As a consequence of the above result, we obtain the following technical inequality which will be frequently used in our analysis. Corollary 2 Let λ be an Lagrange multiplier for 1). Then, for every x X, we have fx) f λ d K Ax)). Proof. It is well-nown that our assumptions imply that v is a convex function such that λ v0), and hence that vu) v0) λ, u = λ, u λ d K u) = λ d K u), u R m, where the inequality follows from Proposition 1b) and the fact that λ K. Now, let x X be given. Since x is clearly feasible for problem 4) with u = Ax), the definition of v ) in 4) implies that vu) fx). Hence, fx) f vu) v0) λ d K u) = λ d K Ax)). Corollary 3 If x X is an ɛ p -feasible solution, i.e., if d K A x)) ɛ p, then f x) f ɛ p λ, where λ is an arbitrary Lagrange multiplier for 1). Another way of defining a near optimal solution of 1) is based on the following optimality conditions: x X is an optimal solution of 1) and λ K is a Lagrange multiplier for 1) if, and only if, x, λ) = x, λ ) satisfies A x) K, λ, A x) = 0, f x) + A 0 ) λ NX x), 6) where N X x) := {s R n : s, x x 0, x X} denotes the normal cone of X at x. Based on this observation, we can introduce our second definition of a near optimal solution of 1). Definition 2 For a given pair ɛ p, ɛ d ) IR ++ IR ++, x, λ) X K) is called an ɛ p, ɛ d )-primaldual solution of 1) if where Bη) := {x R n : x η} for every η 0. d K A x)) ɛ p, λ, Π K A x)) = 0, 7) f x) + A 0 ) λ NX x) + Bɛ d ), 8) In Sections 4 and 5, we will study the iteration-complexities of well-nown penalty methods, namely the quadratic and the exact penalty methods, for computing the two types of near optimal solutions defined in this section. 4

3 Technical results This section discusses some technical results that will be used in our analysis and reviews two first order algorithms due to Nesterov [2, 4] for solving special classes of CP problems. It consists of three subsections. The first one develops several technical results involving projected gradients. The second subsection reviews Nesterov s optimal method for solving a class of smooth CP problems. The third subsection reviews Nesterov s smooth approximation scheme for solving a class of non-smooth CP problems that can be reformulated as smooth convex-concave saddle point i.e., min-max) problems. 3.1 Projected gradient and the optimality conditions In this subsection, we consider an inner product space R n endowed with the norm associated with its inner product. We consider the CP problem φ := min φx), 9) x X where X R n is a convex set and φ : X IR is a convex function that has L φ -Lipschitz-continuous gradient over X with respect to the norm. It is well-nown that x X is an optimal solution of 9) if and only if φx ) N X x ). Moreover, this optimality condition is in turn related to the projected gradient of the function φ over X defined as follows. Definition 3 Given a fixed constant τ > 0, we define the projected gradient of φ at x X with respect to X as see, for example, [3]) φ x)] τ X := 1 τ [ x Π X x τ φ x))], 10) where Π X ) is the projection map onto X defined in terms of the inner product norm see Subsection 1.1). The following proposition relates the projected gradient to the aforementioned optimality condition. Proposition 4 Let x X be given and define x + := Π X x τ φ x)). Then, for any given ɛ 0, the following statements hold: a) φ x)] τ X ɛ if, and only if, φ x) N X x + ) + Bɛ); b) φ x)] τ X ɛ implies that φ x+ ) N X x + ) + B 1 + τl φ )ɛ). Proof. To simplify notation, define v := φ x)] τ X. By 10) and well-nown properties of the projection operator Π X, we have v = φ x)] τ X x τv = Π X x τ φ x)) x τ φ x) x τv), y x τv) 0, y X v φ x), y x τv) 0, y X v φ x), y x + 0, y X φ x) v N X x + ). 5

from which statement a) clearly follows. To show b), assume that v ɛ. Then, we have x + x Π X x τ φ x)) x = x τv) x = τ v τɛ. which, together with the assumption that φ has L φ -Lipschitz-continuous gradient, implies that φ x + ) φ x) τl φ ɛ. Statement b) now follows immediately from the latter conclusion and statement a). The following lemma summarizes some interesting properties of the projected gradient. Lemma 5 Let x X be given and x + := Π X x τ φ x)). Denoting g ) := φ ), g X ) := φ )] τ X. We have: a) φ x + ) φ x) τ g X x) 2 /2 for any τ 1/L φ ; b) For any x X, there holds in particular, setting x = x, we obtain c) For any x X, there holds d) If τ = 1/L φ in definition 10), then where x Argmin x X φx). g x) g X x), x x + 0; 11) g x) g X x), g X x) 0; 12) φ x + ) φx) 1 + τl φ ) g X x) x + x. 13) φx) φx ) 1 2L φ g X x) 2, x X, 14) Proof. a) This statement is proved in p. 87 of [3]. b) Noting that x + = x τg X x) = Π X x τg x)), it follows from well-nown properties of the projection map Π X that x x τg X ), x τg x) x τg X x)) = x x τg X x)), τg X x) g x)) = x x +, τg X x) g x)) 0, x X, which clearly implies statement b). c) It follows from from the convexity of φ ), 11), the assumption that φ ) has L φ -Lipschitzcontinuous gradient, and definition 10) that φ x + ) φx) g x + ), x + x = g x) g X x), x + x + g X x), x + x + g x + ) g x), x + x g X x), x + x + g x + ) g x), x + x g X x), x + x + L φ x + x x + x = g X x), x + x + τl φ g X x) x + x 1 + τl φ ) g X x) x + x, x X. 6

d) Using the fact that Φx ) Φx), x X and the assumption that φ ) has L φ -Lipschitzcontinuous gradient, we conclude that φx ) φ x 1 ) g X x) φx) 1 gx), g X x) + 1 g X x) 2 L φ L φ 2L φ φx) 1 2L φ g X x) 2 for any x X, where the last inequality follows from 12). 3.2 Nesterov s Optimal Method In this subsection, we discuss Nesterov s smooth first-order method for solving a class of smooth CP problems. Our problem of interest is still the CP problem 9), which is assumed to satisfy the same assumptions mentioned in Subsection 3.1, except that now we allow to be an arbitrary norm. Moreover, we assume throughout our discussion that the optimal value φ of problem 9) is finite and that its set of optimal solutions is nonempty. Let h : X IR be a differentiable strongly convex function with modulus σ h > 0 with respect to, i.e., hx) h x) + h x), x x + σ h 2 x x 2, x, x X. 15) The Bregman distance d h : X X IR associated with h is defined as d h x; x) hx) l h x; x), x, x X, 16) where l h : R n X IR is the linear approximation of h defined as l h x; x) = h x) + h x), x x, x, x) R n X. We are now ready to state Nesterov s smooth first-order method for solving 9). We use the superscript sd in the sequence obtained by taing a steepest descent step and the superscript ag which stands for aggregated gradient ) in the sequence obtained by using all past gradients. Nesterov s Algorithm: end 0) Let x sd 0 = xag 0 X be given and set = 0 1) Set x = 2 +2 xag + +2 xsd and compute φx ) and φ x ). 2) Compute x sd +1, xag +1 ) X X as x sd +1 Argmin x ag +1 argmin 3) Set + 1 and go to step 1. { l φ x; x ) + L φ { L φ σ h d h x; x 0 ) + } 2 x x 2 : x X i=0 i + 1 2 [l φ x; x i )] : x X, 17) }. 18) The main convergence result established by Nesterov [4] regarding the above algorithm is sum- 7

marized in the following theorem. Theorem 6 The sequence {x sd } generated by Nesterov s optimal method satisfies φx sd ) φ 4L φ d h x ; x sd 0 ), 1, σ h + 1) where x is an optimal solution of 9). satisfying φx sd ) φ ɛ can be found in no more than d h x ; x sd 0 )L φ σ h ɛ 2 As a consequence, given any ɛ > 0, an iterate x sd X 19) iterations. The following result is as an immediate special case of Theorem 6. Corollary 7 Suppose that is a inner product norm and h : X R is chosen as h ) = 2 /2 in Nesterov s optimal method. Then, for any ɛ > 0, an iterate x sd X satisfying φxsd ) φ ɛ can be found in no more than x sd 0 x 2Lφ 20) ɛ iterations, where x is an optimal solution of 9). Proof. If hx) = x 2 /2, then 16) implies that d h x ; x sd 0 ) = xsd 0 x 2 /2. The corollary now follows from this bound and relation 19). Now assume that the objective function φ is strongly convex over X, i.e., for some µ > 0, φx) φ x), x x µ x x 2, x, x X. 21) Nesterov shows in Theorem 2.2.2 of [3] that, under the assumptions of Corollary 7, a variant of his optimal method finds a solution x X satisfying φx ) φ ɛ in no more than L φ µ log L φ x sd 0 x 2 22) ɛ iterations. The following result gives an iteration-complexity bound for Nesterov s optimal method that replaces the term logl φ x sd 0 x 2 /ɛ) in 22) with logµ x sd 0 x 2 /ɛ). The resulting iterationcomplexity bound is not only sharper but also more suitable since it maes it easier for us to compare the quality of the different bounds obtained in our analysis of first-order penalty methods. Theorem 8 Let ɛ > 0 be given and suppose that the assumptions of Corollary 7 hold and that the function φ is strongly convex with modulus µ. Then, the variant where we restart Nesterov s optimal method, with proximal function h ) = 2 /2, every 8L φ K := 23) µ 8

iterations finds a solution x X satisfying φ x) φ ɛ in no more than 8 L φ max{1, log Q} 24) µ iterations, where and x := argmin x X φx). Q := µ xsd 0 x 2 2ɛ 25) Proof. Denote D := x sd 0 x. First consider the case where Q 1. Clearly we have ɛ µd 2 /2, which, in view of Corollary 7, implies that the number of iterations is bounded by 2 L φ /µ and hence bounded by 24). We now show that bound 24) holds for the case when Q > 1. Let x j be the iterate obtained at the end of j 1)-th restart after K iterations of Nesterov s optimal method are performed. By Theorem 6 with h ) = 2 /2 and inequality 21), we have φx 1 ) φx ) 2L φ x sd 0 x 2 K 2 2L φ D 2 K 2 φx j ) φx ) 2L φ x j 1 x 2 K 2 4L φ [ φx j 1 µ K 2 ) φx ) ], j 2. Using the above relations inductively, we conclude that Setting j = log Q and observing that K 2j [ 8Lφ µ ) 1 2 ] 2j = 2 j 4Lφ µ φx j ) φx ) 4L φ) j D 2 2µ j 1 K 2j. ) j ) j 2 log Q 4Lφ µd2 4L φ ) j µ 2ɛ µ j = D2 4L φ ) j 2ɛ µ j 1, we conclude that φx j ) φx ) ɛ. Hence, the overall number of iterations is bounded by K log Q, or equivalently, by 24). It is interesting to compare the complexity bounds of Corollary 7 and Theorem 8. Indeed, it can be easily seen that there exists a threshold value Q > 0 such that the condition Q Q, where Q is defined in 25), implies that the bound 20) is always greater than or equal to the bound 24). Moreover, as Q goes to infinity, the ratio between 20) and 24) converges to infinity. Hence, when the function φ is strongly convex and Q satisfies 25), we will always use the bound 24) in our analysis. 3.3 Nesterov s smooth approximation scheme In this subsection, we discuss Nesterov s smooth approximation scheme for solving a class of nonsmooth CP problems which can be reformulated as special smooth convex-concave saddle point i.e., min-max) problems. 9

Our problem of interest in this subsection is still problem 9). We assume that X R n is a closed convex set and φ : X IR is a convex function given by φx) := ˆφx) + sup{ Ex, y ψy) : y Y }, x X, 26) where ˆφ : X IR is a convex function with L ˆφ-Lipschitz-continuous gradient with respect to an arbitrary norm on R n, Y R m is a compact convex set, φ : Y IR is a continuous convex function, and E : R n R m is an affine operator. Unless stated explicitly otherwise, we assume that the CP problem 9) referred to in this subsection is the one with its objective function given by 26). Also, we assume throughout our discussion that f is finite and that the set of optimal solutions of 9) is nonempty. The function φ defined as in 26) is generally non-differentiable but can be closely approximated by a function with Lipschitz-continuous gradient using the following construction due to Nesterov [4]. Let h : Y IR be a continuous strongly convex function with modulus σ h > 0 with respect to an arbitrary norm on R m satisfying min{ hy) : y Y } = 0. For some smoothness parameter η > 0, consider the following function φ η x) = ˆφx) + max y { } Ex, y ψy) η hy) : y Y. 27) The next result, due to Nesterov [4], shows that φ η is a function with Lipschitz-continuous gradient with respect to, whose closeness to φ depends linearly on the parameter η. Proposition 9 The following statements hold: a) For every x X, we have φ η x) φx) φ η x) + ηd h, where } D h := max { hy) : y Y ; 28) b) The function φ η x) has L η -Lipschitz-continuous gradient with respect to, where L η := L ˆφ + E 2. 29) ησ h We are now ready to state Nesterov s smooth approximation scheme to solve 9). Nesterov s smooth approximation scheme: end 0) Let ɛ > 0 be given. 1) Let η := ɛ/2d h) and consider the approximation problem φ η inf{φ η x) : x X}. 30) 2) Apply Nesterov s optimal method or its variant) to 30) and terminate whenever an iterate x sd satisfying φxsd ) φ ɛ is found. The following result describes the convergence behavior of the above scheme. 10

Theorem 10 Nesterov s smooth approximation scheme generates an iterate x sd φ ɛ in no more than satisfying φxsd ) 8D h x sd 0 ) 2D h E 2 + L ˆφ) 31) σ h ɛ σ hɛ iterations, where D h x sd 0 ) := max x X d h x; x sd 0 ) and D h is defined in 28). 4 The quadratic penalty method The goal of this section is to establish the iteration-complexity in terms of Nesterov s optimal method iterations) of the quadratic penalty method for solving 1). Specifically, we present the iteration-complexity of the classical quadratic penalty method in Subsection 4.1, and a variant of this method, for which a perturbation term is added into the objective function of 1), is discussed and analyzed in Subsection 4.2. The basic idea underlying penalty methods is rather simple, namely: instead of solving problem 1) directly, we solve certain relaxations of 1) obtained by penalizing some violation of the constraint Ax) K. More specifically, in the case of the quadratic penalty method, given a penalty parameter ρ > 0, we solve the relaxation { Ψ ρ := inf Ψ ρ x) := fx) + ρ x X 2 [d K Ax))]2}. 32) In order to guarantee that the function Ψ ρ is smooth, we assume throughout this section that the distance function d K is defined with respect to an inner product norm in R m. Note that in this case, we have =. We will now see that the objective function Ψ ρ of 32) has Lipschitz continuous gradient. We first state the following well-nown result which guarantees that the distance function has Lipschitz continuous gradient see for example Proposition 15 of [1] for its proof). Proposition 11 Given a closed convex set C R m, consider the distance function d C : R m IR to C with respect some inner product norm on R m. Then, the function ψ : R m R defined as ψu) = [d C u)] 2 is convex and its gradient is given by ψu) = 2[u Π C u)] for every u R m. Moreover, ψũ) ψũ) 2 ũ u for every u, ũ R m. As an immediate consequence of Proposition 11, we obtain the following result. Corollary 12 The function Ψ ρ has M ρ -Lipschitz continuous gradient where M ρ := L f + ρ A 2. Proof. The differentiability of Ψ ρ follows immediately from the assumption that f is differentiable and Proposition 11. Moreover, it easily follows from the chain rule that Ψ ρ x) = fx) + ρa 0 d2 K )Ax))/2, which together with 2) and Proposition 11, then imply that Ψ ρ x 1 ) Ψ ρ x 2 ) fx 1 ) fx 2 ) + ρ 2 A 0 d 2 K)Ax 1 )) A 0 d 2 K)Ax 2 )) L f x 1 x 2 + ρ 2 A 0 d 2 KAx 1 )) d 2 KAx 2 )) L f x 1 x 2 + ρ A 0 A 0 x 1 x 2 ) L f x 1 x 2 + ρ A 0 A 0 x 1 x 2 = M ρ x 1 x 2, 11

for every x 1, x 2 X, where the last equality follows from the fact that A = A 0 = A 0. Therefore, problem 32) can be solved, for example, by Nesterov s optimal method or one of its variants see for example [1, 2, 4]). 4.1 Quadratic penalty method applied to the original problem In this subsection, we consider the application of the quadratic penalty method directly to the original problem 1), which consists of solving the penalized problem 32) for a suitably chosen penalty parameter ρ. This subsection contains two subsubsections. The first one considers the complexity of obtaining an ɛ p, ɛ o )-primal solution of 1) while the second one discusses the complexity of computing an ɛ p, ɛ d )-primal-dual solution of 1). 4.1.1 Computation of a primal approximate solution In this subsection, we are interested in deriving the iteration-complexity involved in the computation of an ɛ p, ɛ o )-primal solution of problem 1) by approximately solving the penalized problem 32) for some value of the penalty parameter ρ. Given an approximate solution x X for the penalized problem 32), the following result provides bounds on the primal infeasibility and optimality gap of x with respect to 1). Proposition 13 If x X is a δ-approximate solution of 32), i.e., it satisfies Ψ ρ x) Ψ ρ δ, 33) then d K A x)) 2 ρ λ + 2δ ρ 34) f x) f δ, 35) where λ is an arbitrary Lagrange multiplier associated with 1). Proof. Using the fact that v0) = f Ψ ρ, Corollary 2 and assumption 33), we conclude that δ Ψ ρ x) Ψ ρ = f x) + ρ 2 [d K A x))]2 Ψ ρ f x) f + ρ 2 [d K A x))]2 λ d K A x)) + ρ 2 [d K A x))]2, which clearly implies both 34) and 35). For the sae of future reference, we state the following simple result. Lemma 14 Let α i, i = 0, 1, 2, be given positive constants. Then, the only positive scalar ρ satisfying the equation α 2 ρ 1 + α 1 ρ 1/2 = α 0 is given by ρ = α 1 + α 2 1 + 4α 0α 2 2α 0 ) 2. 12

With the aid of Proposition 13, we can now derive an iteration-complexity bound for Nesterov s method applied to the quadratic penalty problem 32) to compute an ɛ p, ɛ o )-primal solution of 1). Theorem 15 Let λ be an arbitrary Lagrange multiplier for 1) and let ɛ p, ɛ o ) IR ++ IR ++ be given. If ɛo + ) 2 ɛ o + 4ɛ p t ρ = ρ p t) := 36) 2ɛp for some t λ, then Nesterov s optimal method applied to problem 32) finds an ɛ p, ɛ o )-primal solution of 1) in at most { N p t) := 2 D h x sd 0 ) } L f 2 A 2t + + A σ h ɛ o ɛ p ɛ p ɛ o, 37) iterations, where D h x sd 0 ) is defined in Theorem 10. In particular, if is the inner product norm on R n and h ) = 2 /2, then the above bound reduces to { } 2DX L f 2 A 2t N p t) := + + A, 38) ɛ o ɛ p ɛ p ɛ o where D X := max x1,x 2 X x 1 x 2. Proof. Let x X satisfies 33) with δ = ɛ o. Proposition 13, the assumption that t λ, relation 36) and Lemma 14 with α 0 = ɛ p, α 1 = ɛ o and α 2 = t then imply that f x) f ɛ o and d K A x)) 2 ρ p t) λ + 2 ɛ o ρ p t) 2t ρ p t) + 2ɛ o ρ p t) = ɛ p, and hence that x is an ɛ p, ɛ o )-primal solution of 1). In view of Corollary 7, Nesterov s optimal method finds an approximate solution x as above in at most 2 D h x sd 0 ) M ρ σ h ɛ 0 iterations, where M ρ := L f +ρ A 2. Substituting the value of ρ given by 36) in this iteration bound and using some trivial majorization, we obtain bound 37). We now mae a few observations regarding Theorem 15. First, the choice of ρ given by 36) requires that t λ so as to guarantee that an ɛ o -approximate solution x of 32) satisfies d K A x)) ɛ p. Second, note that the iteration-complexity N p t) obtained in Theorem 15 achieves its minimum possible value over the interval t λ exactly when t = λ. However, since the quantity λ is not nown a priori, it is necessary to use a guess and chec procedure for t so as to develop a scheme for computing an ɛ p, ɛ o )-primal solution of 1) whose iteration-complexity has the same order of magnitude as the ideal one in which t = λ. We now describe the aforementioned guess and chec procedure for t. Search Procedure 1: 13

1) Set = 0 and define D h x sd 0 β 0 := 2 ) ) L f 2 A 2D h x sd 0 +, β 1 := 2 A ) [ ] max1, β0 ) 2, t 0 :=. 39) σ h ɛ o ɛ p σ h ɛ p ɛ o β 1 2) Set ρ = ρ p t ), and perform at most N p t ) iterations of Nesterov s optimal method applied to problem 32). If an iterate x is obtained such that 33) with δ = ɛ o and d K A x)) ɛ p are satisfied, then stop; otherwise, go to step 3; 3) Set t +1 = 2t, = + 1, and go to step 2. Before establishing the iteration-complexity of the above procedure, we state a technical result, namely Lemma 17, which will be used more than once in the paper. Lemma 16 states a simple inequality which is also used in a few times in our presentation. Lemma 16 For any scalars τ 0, x > 0, and α 0, we have τx + α τ + α) x. Lemma 17 Let a positive scalar p be given. Then, there exists a constant C = Cp) such that for any positive scalars β 1, β 2, t, we have β 1 + β 2 t p C β 1 + β 2 t p, 40) where t = t 0 2 for = 1,..., K, and ) maxβ1, 1) 1/p { )} t t 0 :=, K := max 0, log. 41) β 2 t 0 Proof. Assume first that t t 0. Due to the definition of K in 41), we have K = 0 in this case, and hence that β 1 + β 2 t p = β 1 + β 2 t p 0 = β 1 + maxβ 1, 1) 2β 1 + 2 4 β 1 4 β 1 + β 2 t p, where the second equality follows from the definition of t 0. Hence, in the case where t t 0, inequality 40) holds with C = 4. Assume now that t > t 0. By the definition of K in 41), we have K = log t/t 0 ), from which we conclude that K < log t/t 0 ) + 1, and hence that t 0 2 K+1 < 4 t. Using these relations, the inequality 14

log x = log x p )/p x p /p for any x > 0, and the definition of t 0 in 41), we obtain β 1 + β 2 t p K 1 + β 1 + β 2 t p 0 2p 1 + β 1 )1 + K) + β 2 t p 2 K+1)p 0 2 p 1 [ )] t 4 t) p 1 + β 1 ) 2 + log + β 2 t 0 2 p 1 [ 1 + β 1 ) 2 + 1 ) t p ] + 4p p t 0 2 p 1 β 2 t p [ 1 + β 1 ) 2 + 1 β 2 t p )] + 4p p maxβ 1, 1) 2 p 1 β 2 t p 21 + β 1 ) + 2 p β 2 t p + 4p 2 p 1 β 2 t p } 2p 4p 2 + max {2, + 2 p β 1 + β 2 t p ), 1 which, in view of Lemma 16, implies that 40) holds with C = Cp) := 2 + max {2, 2p + 4p 2 p 1 }. The following result gives the iteration-complexity of Search Procedure 1 for obtaining an ɛ p, ɛ o )- primal solution of 1). Corollary 18 Let λ be the minimum norm Lagrange multiplier for 1). Then, the overall number of iterations of Search Procedure 1 for obtaining an ɛ p, ɛ o )-primal solution of 1) is bounded by ON p λ )), where N p ) is defined in 37). Proof. In view of Theorem 15, the iteration count in Search Procedure 1 can not exceed K := max{0, log λ /t 0 ), and hence its overall number of inner i.e., Nesterov s optimal method) iterations is bounded by K N pt ) = K β 0 + β 1 t 1/2, where β 0 and β 1 are defined by 39). The result now follows from the definition of t 0 in 39) and 40) with p 1 = 1/2 and t = λ. 4.1.2 Computation of a primal-dual approximate solution In this subsection, we are interested in deriving the iteration-complexity involved in the computation of an ɛ p, ɛ d )-primal-dual solution of problem 1) by approximately solving the penalized problem 32) for certain value of the penalty parameter ρ. We assume throughout this subsection that the norms on R n and R m are the ones associated with their respective inner products. Given an approximate solution x X for the penalized problem 32), the following result shows that there exists a pair x +, λ) X K) depending on x that approximately satisfies the optimality conditions 6). 15

Proposition 19 If x X is a δ-approximate solution of 32), i.e., it satisfies 33), then the pair x +, λ) defined as x + := Π X x Ψ ρ x)/m ρ ), 42) λ := ρ[a x + ) Π K A x + ))] 43) is in X K) and satisfies the second relation in 7) and the relations d K A x + )) 1 δ ρ λ + ρ, 44) f x + ) + A 0 ) λ N X x + ) + B2 2M ρ δ), 45) where λ is an arbitrary Lagrange multiplier for 1). Proof. It follows from Lemma 5a) with φ = Ψ ρ and τ = M ρ that Ψ ρ x + ) Ψ ρ x), and hence that Ψ ρ x + ) Ψ ρ δ in view of 33). By Proposition 13, this implies that 44) holds. Moreover, by Lemma 5d) with φ = Ψ ρ and L φ = M ρ and assumption 33), we have Ψ ρ x)] 1/Mρ X 2M ρ [Ψ ρ x) Ψ ρ] 2M ρ δ. Relation 45) now follows from this inequality, Proposition 4b) with L φ = M ρ, τ = 1/M ρ and ɛ = 2M ρ δ, and the fact that Ψ ρ x + ) = f x + ) + A 0 ) λ, where λ is given by 43). Finally, λ K) and the second relation of 7) holds in view of the definition of λ and well-nown properties of the projection operator onto a cone. With the aid of Proposition 19, we can now derive an iteration-complexity bound for Nesterov s method applied to the quadratic penalty problem 32) to compute an ɛ p, ɛ d )-primal-dual solution of 1). Theorem 20 Let λ be an arbitrary Lagrange multiplier for 1) and let ɛ p, ɛ d ) IR ++ IR ++ be given. If ρ = ρ pd t) := 1 t + ɛ ) d 46) ɛ p 8 A for some t λ, then Nesterov s optimal method with h ) = 2 /2 applied to problem 32) finds an ɛ p, ɛ d )-primal-dual solution of 1) in at most ) L f N pd t) := 4D X + A 2 t + A 47) ɛ d ɛ p ɛ d 8ɛp iterations, where D X is defined in Theorem 15. Proof. Let x X be a δ-approximate solution of 32) where δ := ɛ 2 d /8M ρ). Noting that δ ɛ 2 d /8ρ A 2 ) in view of the fact that M ρ := L f + ρ A 2 ρ A 2, we conclude from Proposition 19 that the pair x +, λ) defined as x + := Π X x Ψ ρ x)/m ρ ) and λ := ρ[a x + ) Π K A x + ))] satisfies f x + ) + A λ N X x + ) + Bɛ d ) and d K A x + )) 1 ρ λ + δ ρ 1 λ + ρ 16 ɛ d 8 A ) ɛ p,

where the last inequality is due to 46) and the assumption that t λ. We have thus shown that x +, λ) is an ɛ p, ɛ d )-primal-dual solution of 1). In view of Corollary 7, Nesterov s optimal method finds an approximate solution x as above in at most ) ) 2DX 1/2 Mρ 8M 2 1/2 = ρ δ 2DX ɛ 2 d = M ρ 4D X = ɛ d L f + ρ A 4D 2 X iterations. Substituting the value of ρ given by 36) in the above bound, we obtain bound 37). The following result states the iteration-complexity of a guess and chec procedure based on Theorem 20. Its proof is based on similar arguments as those used in the proof of Corollary 18. Corollary 21 Let λ be the minimum norm Lagrange multiplier for 1). Consider the guess and chec procedure similar to Search Procedure 1 where the functions N p and ρ p are replaced by the functions N pd and ρ pd defined in Theorem 20, δ is set to ɛ 2 d /8M ρ), and t 0 is set to [maxβ 0, 1)/β 1 ] with ) L f β 0 = 4D X + A, β 1 = 4D X A 2. ɛ d 8ɛp ɛ p ɛ d Then, the overall number of iterations of this guess and chec procedure for obtaining an ɛ p, ɛ d )- primal-dual solution of 1) is bounded by ON pd λ )), where N pd ) is defined in 47). 4.2 Quadratic penalty method applied to a perturbed problem Throughout this subsection, we assume that the norm on R n is the one associated with its inner product. Recall that throughout Section 4, we are also assuming that the norm on R m is the one associated with its inner product.) Consider the perturbation problem f γ := min{f γ x) := fx) + γ 2 x x 0 2 : Ax) K, x X}, 48) where x 0 is a fixed point in X and γ > 0 is a pre-specified perturbation parameter. It is well-nown that if γ is sufficiently small, then an approximate solution of 48) will also be an approximate solution of 1). In this subsection, we will derive the iteration-complexity in terms of Nesterov s optimal method iterations) of computing an approximate solution of 1) by applying the quadratic penalty approach to the perturbation problem 48) for a conveniently chosen perturbation parameter γ > 0. The subsection is divided into two subsubsections: the computation of a primal approximate solution of 1) is treated in the first subsubsection while the computation of a primal-dual approximate solution of 1) is treated in the second one. 4.2.1 Computation of a primal approximate solution The quadratic penalty problem associated with 48) is given by Ψ ρ,γ := min x X { Ψ ρ,γ x) := fx) + γ 2 x x 0 2 + ρ 2 d K Ax))2}. 49) ɛ d 17

It can be easily seen that the function Ψ ρ,γ has M ρ,γ -Lipschitz continuous gradient where M ρ,γ := L f + ρ A 2 + γ. 50) The following simple lemma relates the optimal values of the perturbation problem 48) and the original problem 1). Lemma 22 Let f, Ψ ρ, f γ, and Ψ ρ,γ be the optimal values defined in 1), 32), 48), and 49), respectively. Then, where D X := max x1,x 2 X x 1 x 2. 0 fγ f γdx 2 /2 51) 0 Ψ ρ,γ Ψ ρ γdx 2 /2, 52) Proof. The first inequalities in both relations 51) and 52) follow immediately from the fact that f γ f and Ψ ρ,γ Ψ ρ. Now, let x and x γ be optimal solutions of 1) and 48), respectively. Then, f γ = fx γ) + γ 2 x γ x 0 2 fx ) + γ 2 x x 0 2 f + γd2 X 2, from which the second inequality in 51) immediately follows. The second inequality in 52) can be similarly shown. The following theorem gives the iteration-complexity of computing an ɛ p, ɛ o )-primal solution of 1) by applying the quadratic penalty approach to the perturbation problem 48) with an appropriate choice of perturbation parameter γ > 0. Theorem 23 Let λ γ be an arbitrary Lagrange multiplier for 48) and let ɛ p, ɛ o ) IR ++ IR ++ be given. If ρ = ρ p t) := 4t and γ = ɛ o ɛ p DX 2, 53) for some t λ γ, then the variant of Nesterov s optimal method of Theorem 8 with µ = γ and L φ = M ρ,γ applied to problem 49) finds an ɛ p, ɛ o )-primal solution of 1) in at most Ñ p t) := 2 2D X L f ɛ o + 2 A iterations, where D X is defined in Theorem 15. t ɛ p ɛ o ) + 1 { max 1, log ɛ } o tɛ p Proof. Assume that ρ and γ are given by 53) for some t λ γ. Let δ := min{ɛ o /4, ɛ p t/2} and x X be a δ-approximate solution of 49), i.e., Ψ ρ,γ x) Ψ ρ,γ δ. Then, Proposition 13 with f = f γ, Ψ ρ = Ψ ρ,γ and ρ = ρ p t) and the assumption that t λ γ imply that d K A x)) 2 2δ ρ p t) λ γ + ρ p t) ɛ p λ γ + 2t tɛ p 4t/ɛ p ɛ p 2 + ɛ p 2 = ɛ p 54) 18

and f γ x) f γ δ ɛ o /4. The later inequality together with 51) and 53) in turn imply that f x) f f γ x) f [f γ x) f γ ] + [f γ f ] ɛ o 4 + ɛ o 2 < ɛ o. We have thus proved that x is an ɛ p, ɛ o )-primal solution of 1). Now, using Theorem 8 with φ = Ψ ρ,γ, µ = γ and ɛ = δ, and noting that the definition of γ in 53) and the definition of δ imply that Q = µ xsd 0 x 2 = γ xsd 0 x 2 { γd2 X ɛ o = 2ɛ 2δ 2δ min{ɛ o /2, ɛ p t} = max 2, ɛ } o, ɛ p t we conclude that the number of iterations performed by Nesterov s optimal method with the restarting feature) for finding a δ-approximate solution x as above is bounded by { 8M ρ,γ 8M ρ,γ log Q = max 1, log ɛ } o. γ γ tɛ p Now, using relations 50) and 53) and some trivial majorization, we easily see that the latter bound is majorized by 54). Note that the complexity bound 54) derived in Theorem 23 is guaranteed only under the assumption that t λ γ, where λ γ is the minimum norm Lagrange multiplier for 48). Since the bound 54) is a monotonically increasing function of t, the ideal theoretical) choice of t would be to set t = λ γ. Without assuming any nowledge of this Lagrange multiplier, the following result shows that a guess and chec procedure similar to Search Procedure 1 still has an OÑp λ γ )) iterationcomplexity bound, where Ñp ) is defined in 54). Before establishing the iteration-complexity of this guess and chec procedure, we state the following technical result which substantially generalizes the one stated in Lemma 17. The proof of this result is given in the Appendix. Proposition 24 For some positive integer L, let positive scalars p 1, p 2,, p L be given. Then, there exists a constant C = Cp 1,, p L ) such that for any positive scalars β 0, β 1,, β L, ν, and t, we have β 0 + βl t p ) { } l max 1, log νt C β 0 + β l β l t p l ) { max 1, log ν t }, 55) where { )} ) t maxβ0, 1) 1/pl K := max 0, log, t 0 := min, t = t 0 2, = 1,..., K. 56) t 0 1lL In particular, if ν = t, then 55) implies that β 0 + βl t p ) l C β 0 + β l t p l ). 57) We are now ready to discuss the iteration-complexity bound of the above-mentioned guess and chec procedure. 19

Corollary 25 Let λ γ be the minimum norm Lagrange multiplier for 48). Consider the guess and chec procedure similar to Search Procedure 1 where the functions N p and ρ p are replaced by the functions Ñp and ρ p defined in Theorem 23, δ is set to δ := min{ɛ o /4, ɛ p t /2}, and t 0 is set to [maxβ 0, 1)/β 1 ] 2 with β 0 = 2 L f 2D X + 1, β 1 = 4 2DX A. 58) ɛ o ɛp ɛ o Then, the overall number of iterations of this guess and chec procedure for obtaining an ɛ p, ɛ o )- primal-dual solution of 1) is bounded by OÑp λ γ )), where Ñp ) is defined in 54). Proof. In view of Theorem 23, the iteration count in Search Procedure 1 can not exceed K := max{0, log λ γ /t 0 ), and hence its overall number of inner i.e., Nesterov s optimal method) iterations is bounded by K Ñpt ) = K β 0 + β 1 t 1/2 max{1, logν/t )}, where β 0 and β 1 are defined by 58), and ν = ɛ 0 /ɛ p. The result now follows from the definition of t 0 and Proposition 24 with L = 1, p 1 = 1/2, and t = λ γ. It is interesting to compare the two functions N p t) and Ñpt) defined in Theorems 15 and 23, respectively. Indeed, both functions up to some constant factor) are comparable for the case when r t := ɛ o /tɛ p ) = O1). Otherwise, if r t = Ω1) and, we further assume that the term involving t dominates the other terms in the right hand side of 54), i.e., 2 L f 2D X + 1 = O 4 ) t 2D X A, ɛ o ɛ p ɛ o then ) Ñ p t) log N p t) O1) rt, rt for some universal constant O1). Hence, in the latter case, we conclude that the function Ñpt) is asymptotically smaller than the function N p t). 4.2.2 Computation of a primal-dual approximate solution In this subsection, we derive the iteration-complexity of computing a primal-dual approximate solution of 1) by applying the quadratic penalty method to the perturbation problem 48). We will show that the resulting approach has a substantially better iteration-complexity than the one discussed in Subsection 4.1.2, which consists of applying the quadratic penalty method directly to the original problem 1). Theorem 26 Let λ γ be an arbitrary Lagrange multiplier for 48) and let ɛ p, ɛ d ) IR ++ IR ++ be given. Also assume that ρ = ρ pd t) := 1 ) ɛ t + d, γ = ɛ d, 59) ɛ p 32 A 2D X for some t λ γ. Then, the variant of Nesterov s optimal method of Theorem 8 with µ = γ and L φ = M ρ,γ applied to problem 49) finds an ɛ p, ɛ d )-primal-dual solution of 1) in at most Ñ pd t) := 8St) 2 log2st)), 60) 20

where D X is defined in Theorem 15 and ) ] 1 L f St) := [2D X + A 2 1/2 2D X + 1 + A t 1/2. 61) ɛ d 32ɛp ɛp ɛ d Proof. Define δ := ɛ 2 d /32M ρ,γ) and let x X be a δ-approximate solution for 49), i.e., Ψ ρ,γ x) Ψ ρ,γ δ. It then follows from Proposition 19 with Ψ ρ = Ψ ρ,γ, f = f γ, and M ρ = M ρ,γ that the pair x +, λ) defined as x + := Π X x Ψ ρ,γ x)/m ρ,γ ) and λ := ρ[a x + ) Π K A x + ))] satisfies f γ x + ) + A λ N X x + ) + B2 2M ρ,γ δ) = N X x + ) + Bɛ d /2). This together with the fact that γ x + x 0 γd X ɛ d /2 then imply that f x + ) + A λ = [ f γ x + ) γ x + x 0 )] + A λ γ x + x 0 ) N X x + ) + Bɛ d /2) N X x + ) + Bɛ d ). Moreover, noting that δ := ɛ 2 d /32M ρ,γ) ɛ 2 d /32ρ A 2 ), we conclude that Proposition 19 also implies that d K A x + )) 1 δ ρ λ γ + ρ 1 ) λ ɛ d ρ γ + ɛ p, 32 A where the last inequality is due to 59) and the assumption that t λ γ. We have thus shown that x +, λ) is an ɛ p, ɛ d )-primal-dual solution of 1). Now, using Theorem 8 with φ = Ψ ρ,γ, µ = γ and ɛ = δ, and noting that the definition of δ and the definition of γ in 59) imply that Q = µ xsd 0 x 2 2ɛ = γ xsd 0 x 2 2δ γd2 X 2δ = γd 2 X ɛ 2 d /16M ρ,γ) = 4M ρ,γ, γ we conclude that the number of iterations performed by Nesterov s optimal method with the restarting feature) for finding a δ-approximate solution x as above is bounded by ) M ρ,γ M ρ,γ 4Mρ,γ 8 log Q = 8 log. 62) γ γ γ Now, using 50) and the definitions of ρ and γ in 53), we easily see that the latter bound is majorized by 60). Note that the complexity bound 60) derived in Theorem 26 is guaranteed only under the assumption that t λ γ, where λ γ is the minimum norm Lagrange multiplier for 48). Since the bound 60) is a monotonically increasing function of t, the ideal theoretical) choice of t would be to set t = λ γ. Without assuming any nowledge of this Lagrange multiplier, the following result shows that a guess and chec procedure similar to Search Procedure 1 still has an OÑpd λ γ )) iteration-complexity bound, where Ñpd ) is defined in 60). Corollary 27 Let λ γ be the minimum norm Lagrange multiplier for 48). Consider the guess and chec procedure similar to Search Procedure 1 where the functions N p and ρ p are replaced by the functions Ñpd and ρ pd defined in Theorem 26, Nesterov s optimal method is replaced by its variant 21

of Theorem 8 with µ = γ and L φ = M ρ,γ see step 2 of Search Procedure 1), δ is set to ɛ 2 d /32M ρ,γ), and t 0 is set to [maxβ 0, 1)/β 1 ] 2 with β 0 = 8 ) ] 1/2 L f [2D X + A + 1, β 1 = 8 2D 1/2 X A ɛ d 32ɛp ɛ p ɛ d ) 1/2. 63) Then, the overall number of iterations of this guess and chec procedure for obtaining an ɛ p, ɛ d )- primal-dual solution of 1) is bounded by OÑpd λ γ )), where Ñpd ) is defined in 60). Proof. In view of Theorem 26, the iteration count of the above guess and chec see Search Procedure 1) for obtaining an ɛ p, ɛ d )-primal-dual solution of 1) can not exceed K := max{0, log λ /t 0 ), and hence its overall number of inner i.e., Nesterov s optimal method) iterations is bounded by Ñ pd t ) = 8St ) 2 log2st )) 2 log2st K )) 8St ). 64) Now, the definition of t 0 and relation 40) with p = 1/2 and t = λ γ imply that 8St ) = β 0 + β 1 t 1/2 = O β 0 + β 1 λ γ 1/2) = O 8S λ γ ) ), where β 0 and β 1 are defined by 63). Moreover, using the fact that t K 2 λ γ, we easily see that 2 log2st K )) = O 2 log2s λ γ )) ). Substituting the last two bounds into 64), we then conclude that K Ñpdt ) = OÑpd λ γ )), and hence that the corollary holds. It is interesting to compare the functions N pd t) and ˆN pd t) defined in Theorems 20 and 26, respectively. It follows from 47), 60), and 61) that Ñ pd t) N pd t) 8St) 2 log2st))) St)) 2 1 log St) O1) St) 1, 65) where O1) denotes an absolute constant. Hence, when St) is large, the bound Ñpdt) can be substantially smaller than the bound N pd t). Note that we can not use the previous observation to compare the iteration-complexity of Corollary 21 with that obtained in Corollary 27 since the first one is expressed in terms of λ and the later in terms of λ γ. However, if λ γ = O λ ), then it can be easily seen that 65) implies that Ñ pd λ γ ) N pd λ ) O1) log S λ ) S λ ) 1. Hence, the first complexity is better than the second one whenever λ γ = O λ ) and S λ ) is sufficiently large. The following result describes an alternative way of choosing the penalty parameter ρ in subproblem 49) as a function of t, but requires that t λ instead of t λ γ. 22

Theorem 28 Let λ be an arbitrary Lagrange multiplier for 1) and let ɛ p, ɛ d ) IR ++ IR ++ be given. Assume that ɛd D X /2 + ) 2 ɛ d D X /2) + 4αt)ɛ p ρ = ˆρ pd t) :=, γ = ɛ d, 66) 2ɛ p 2D X where αt) := t+ɛ d / 32 A ) for some t λ. Then, the variant of Nesterov s optimal method of Theorem 8 with µ = γ and L φ = M ρ,γ applied to problem 49) finds an ɛ p, ɛ d )-primal-dual solution of 1) in at most ˆN 8Ŝt) pd t) := 2 log2ŝt)), 67) where D X is defined in Theorem 15 and 2L f D X Ŝt) := + A ɛ d D X + ɛ p 2tD X ɛ p ɛ d ) + A DX 8 1/4 ɛ p + 1. 68) Proof. Define δ := ɛ 2 d /32M ρ,γ) and let x X be a δ-approximate solution for 49), i.e., Ψ ρ,γ x) Ψ ρ,γ δ. As shown in the proof of Theorem 26, the pair x +, λ) defined as x + := Π X x Ψ ρ,γ x)/m ρ,γ ) and λ := ρ[a x + ) Π K A x + ))] satisfies f x + ) + A λ N X x + ) + Bɛ d ). Moreover, it follows from Lemma 5a) with φ = Ψ ρ,γ and τ = M ρ,γ that Ψ ρ,γ x + ) Ψ ρ,γ x), and hence that Ψ ρ,γ x + ) Ψ ρ,γ Ψ ρ,γ x) Ψ ρ,γ δ. This observation together with 32), 49), 52) and 66) then imply that Ψ ρ x + ) Ψ ρ = Ψ ρ,γ x + ) γ 2 x+ x 0 2 Ψ ρ [Ψ ρ,γ x + ) Ψ ρ,γ] + [Ψ ρ,γ Ψ ρ] δ + γd 2 X ɛ 2 d 32ρ A 2 + ɛ dd X, 2 where the last inequality follows from the definition of δ and the fact that M ρ ρ A 2. The above inequality together with Proposition 13, the assumption that t λ, relation 66) and Lemma 14 with α 0 = ɛ p, α 1 = ɛ d D X /2 and α 2 = t + ɛ d / 32 A ) =: αt) then imply that d K A x + )) 1 ρ λ + ɛ 2 d 32ρ 2 A 2 + D Xɛ d 2ρ 1 ρ t + ɛ d 32 A ) + D X ɛ d 2ρ We have thus shown that x +, λ) is an ɛ p, ɛ d )-primal-dual solution of 1). Now, using Theorem 8 with φ = Ψ ρ,γ, µ = γ and ɛ = δ, and noting that the definition of δ and the definition of γ in 66) imply that Q = µ xsd 0 x 2 2ɛ = γ xsd 0 x 2 2δ γd2 X 2δ γd 2 X = ɛ 2 d /16M ρ,γ) = 4M ρ,γ, γ we conclude that the number of iterations performed by Nesterov s optimal method with the restarting feature) for finding a δ-approximate solution x as above is bounded by 62). Since relations 50) and 66) and the definition of αt) imply that M ρ,γ γ = L f + ρ A 2 + γ γ L f ρ γ + A γ + 1 = 23 2L f D X ɛ d = ɛ p. ρ + A γ + 1

and ρ γ ) ɛd D X /2 αt) 2DX + = D X + ɛ p ɛ p ɛ d ɛ p ) D X 2tD X + +, ɛ p ɛ p ɛ d D X 8ɛp A t + ɛ d / 32 A ) 2DX ɛ p ɛ d it follows that 62) can be bounded by 67). The following result states the iteration-complexity of a guess and chec procedure based on Theorem 28. Its proof is based on similar arguments as those used in the proof of Corollary 27. Corollary 29 Let λ be the minimum norm Lagrange multiplier for 1). Consider the guess and chec procedure similar to Search Procedure 1 where the functions N p and ρ p are replaced by the functions ˆN pd and ˆρ pd defined in Theorem 26, Nesterov s optimal method is replaced by its variant of Theorem 8 with µ = γ and L φ = M ρ,γ see step 2 of Search Procedure 1), δ is set to ɛ 2 d /32M ρ,γ), and t 0 is set to [maxβ 0, 1)/β 1 ] 2 with β 0 = 8 2L f D X ɛ d + A D X ɛ p + ) A D X 2D X + 1, β 1 = 8 A. 8ɛp ɛ p ɛ d Then, the overall number of iterations of this guess and chec procedure for obtaining an ɛ p, ɛ d )- primal-dual solution of 1) is bounded by O ˆN pd λ )), where ˆN pd ) is defined in 67). It is interesting to compare the functions N pd t) and ˆN pd t) defined in Theorems 20 and 26, respectively. When the second term in the right hand side of 68) is dominated by the other terms, i.e., then it can be easily seen that A D X ɛ p = O A t 1/2 D 1/2 X + ɛp ɛ d L f D X ɛ d ˆN pd t) N pd t) O1)Ñpdt) log St) O1) N pd t) St) 1. ), 69) Hence, if 69) holds and St) is large, then ˆN pd t) can be considerably smaller than N pd t). It is also interesting to observe when ɛ p = ɛ d = ɛ, the dependence on ɛ of the function ˆN pd is O1/ɛ) log1/ɛ) while that of N pd is O1/ɛ 2 ). 5 The exact penalty method The goal of this section is to establish the iteration-complexity in terms of Nesterov s optimal method iterations) of first-order exact penalty methods for computing a ɛ p -primal solution of 1). It consists of two subsections. The first one considers the complexity of a first-order exact penalty method applied directly to the original problem 1) while the second one deals with the complexity of a first-order exact penalty method applied to the perturbation problem 48) for some suitably chosen perturbation parameter θ. 24