constrained convex optimization, level-set technique, first-order methods, com-

Size: px

Start display at page:

Download "constrained convex optimization, level-set technique, first-order methods, com-"

Eugenia Hopkins
6 years ago
Views:

1 A LEVEL-SET METHOD FOR CONVEX OPTIMIZATION WITH A FEASIBLE SOLUTION PATH QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI Abstract. Large-scale constrained convex optimization problems arise in several application domains. First-order methods are good candidates to tackle such problems due to their low iteration complexity and memory requirement. The level-set framework extends the applicability of first-order methods to tackle problems with complicated convex objectives and constraint sets. Current methods based on this framework either rely on the solution of challenging subproblems or do not guarantee a feasible solution, especially if the procedure is terminated before convergence. We develop a level-set method that finds an ɛ-relative optimal and feasible solution to a constrained convex optimization problem with a fairly general objective function and set of constraints, maintains a feasible solution at each iteration, and only relies on calls to first-order oracles. We establish the iteration complexity of our approach, also accounting for the smoothness and strong convexity of the objective function and constraints when these properties hold. The dependence of our complexity on ɛ is similar to the analogous dependence in the unconstrained setting, which is not known to be true for level-set methods in the literature. Nevertheless, ensuring feasibility is not free. The iteration complexity of our method depends on a condition number, while existing level-set methods that do not guarantee feasibility can avoid such dependence. 9 0 Key words. plexity analysis constrained convex optimization, level-set technique, first-order methods, com AMS subject classifications. 90C5, 90C30, 90C5. Introduction. Large scale constrained convex optimization problems arise in several business, science, and engineering applications. A commonly encountered form for such problems is f := min x X fx s.t. g i x 0, i =,..., m, where X R n is a closed and convex set, and f and g i, i =,..., m, are convex real functions defined on X. Given ɛ > 0, a point x ɛ is ɛ-optimal if it satisfies fx ɛ f ɛ and ɛ-feasible if it satisfies max i=,...,m g i x ɛ ɛ. In addition, given 0 < ɛ, and a feasible and suboptimal solution e X, a point x ɛ is ɛ-relative optimal with respect to e if it satisfies fx ɛ f /fe f ɛ. Level-set techniques [, 3,, 7, 7, 8] provide a flexible framework for solving the optimization problem -. Specifically, this problem can be reformulated as a convex root finding problem L t = 0, where L t = min x X max{fx t; g i x, i =,..., m} see Lemaréchal et al. [7]. Then a bundle method [6, 9, 4] or an accelerated gradient method [0, Section.3.5] can be applied to handle the minimization subproblem to estimate L t and facilitate the root finding process. In general, these methods solve the subproblem using a nontrivial quadratic program that needs to be solved at each iteration. An important variant of - arising in statistics, image processing, and signal processing occurs when fx takes a simple form, such as a gauge function, and there is a single constraint with the structure ρax b λ, where ρ is a convex function, A is a matrix, b is a vector, and λ is a scalar. In this setting, Van Den Berg and Tippie College of Business, University of Iowa, USA, qihang-lin@uiowa.edu College of Business Administration, University of Illinois at Chicago, USA, selvan@uic.edu College of Business Administration, University of Illinois at Chicago, USA, nazad@uic.edu

2 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI Friedlander [7, 8] and Aravkin et al. [] use a level-set approach to swap the roles of the objective function and constraints in - to obtain the convex root finding problem vt = 0 with vt := min x X,fx t ρax b λ. Harchaoui et al. [] proposed a similar approach. The root finding problem arising in this approach is solved using inexact newton and secant methods in [7, 8, ] and [], respectively. Since fx is assumed to be simple, projection onto the feasible set {x R n x X, fx t} is possible and the subproblem in the definition of vt can be solved by various first-order methods. Although the assumption on the objective function is important, the constraint structure is not limiting because can be obtained by choosing A to be an identity matrix, b = 0, λ = 0, and ρx = max i=,...,m g i x. The level-set techniques discussed above find an ɛ-feasible and ɛ-optimal solution to -, that is, the terminal solution may not be feasible. This issue can be overcome when the infeasible terminal solution is also super-optimal and a strictly feasible solution to the problem exists. In this case, a radial-projection scheme [, 5] can be applied to obtain a feasible and ɛ-relative optimal solution. Super-optimality of the terminal solution holds if the constraint fx t is satisfied at termination. This condition can be enforced when projections on to the feasible set with this constraint are computationally cheap, which typically requires fx to be simple e.g. see []. Otherwise, such a guarantee does not exist. Building on this rich literature, we develop a feasible level-set method that i employs a first-order oracle to solve subproblems and compute a feasible and ɛ-relative optimal solution to - with a general objective function and set of constraints while also maintaining feasible intermediate solutions at each iteration i.e., a feasible solution path, and ii avoids potentially costly projections. Our focus on maintaining feasibility is especially desirable in situations where level-set methods need to be terminated before convergence, for instance due to a time constraint. In this case, our level-set approach would return a feasible, albeit possibly suboptimal, solution that can be implemented whereas the solution from existing level-set approaches may not be implementable due to potentially large constraint violations. Convex optimization problems where constraints model operating conditions that need to be satisfied by solutions for implementation arise in several applications, including network and nonlinear resource allocation, signal processing, and circuit design [7, 8, 3]. The radial sub-gradient method proposed by Renegar [5] and extended by Grimmer [0] also emphasizes feasibility. This method avoids projection and finds a feasible and ɛ-relative optimal solution by performing a line search at each iteration. This line search, while easy to execute for some linear or conic programs, can be challenging for general convex programs -. Moreover, it is unknown whether the iteration complexity of this approach can be improved by utilizing the potential smoothness and strong convexity of the objective function and constraints. In contrast, the feasible level-set approach in this paper avoids the need for line search, in addition to projection, and its iteration complexity leverages the strong convexity and smoothness of the objective and constraints when these properties are true. Our work is indeed related to first-order methods, which are popular due to their low per-iteration cost and memory requirement, for example compared to interior point methods, for solving large scale convex optimization problems []. Most existing first-order methods find a feasible and ɛ-optimal solution to a version of the optimization problem - with a simple feasible region e.g. a box, simplex or ball which makes projection on to the feasible set at each iteration inexpensive. Ensuring feasibility using projection becomes difficult for general constraints. Some sub-gradient methods, including the method by Nesterov [0, Section 3..4] and its

3 recent variants by Lan and Zhou [5] and Bayandina [4], can be applied to - without the need for projection mappings. However these methods do not ensure feasibility of the feasible solution as we do, that is, they instead find an ɛ-feasible and ɛ-optimal solution to - at convergence. Given the focus of our research, one naturally wonders whether there is an associated computational cost of ensuring feasibility in the context of level-set methods. Our analysis suggests that the answer depends on the perspective one takes. First, consider the dependence of the iteration complexity on a condition measure as a criterion. A nice property of the level-set method in [] is that its iteration complexity is independent of a condition measure for finding an ɛ-feasible and ɛ-optimal solution. Instead, finding an ɛ-relative optimal and feasible solution leads to iteration complexities with dependence on condition measures when i combining the level-set approach [] and the transformation in [5], or ii using our feasible level-set approach. This suggests that a cost of ensuring feasibility is the presence of a condition measure in the iteration complexity. Second, suppose we consider the dependence of the number of iterations on ɛ as our assessment criterion. The complexity result of the level-set approach provided by [] can be interpreted as the product of the iteration complexity of the oracle used to solve the subproblem and a Olog/ɛ factor. The dependence of our algorithm would be similar under a non-adaptive analysis akin to []. However, using a novel adaptive analysis we find that the iteration complexity of our feasible level-set approach can in fact eliminate the Olog/ɛ factor. In other words, our complexity is analogous to the known complexity of first order methods for unconstrained convex optimization and it is unclear to us whether a similar adaptive analysis can be applied to the approach in []. Overall, in terms of the dependence on ɛ there appears to be no cost to ensuring feasibility. The rest of this paper is organized as follows. Section introduces the level-set formulation and our feasible level-set approach, also establishing its outer iteration complexity to compute an ɛ-relative optimal and feasible solution to the optimization problem -. Section 3 describes two first-order oracles that solve the subproblem of our level-set formulation when the functions f and g i, i =,..., m, are smooth and non-smooth, and in each case accounts for the potential strong convexity of these functions. The inner iteration complexity of each oracle is also analyzed in this section. Section 4 presents an adaptive complexity analysis to establish the overall iteration complexity of using our feasible level-set approach.. Feasible Level-set Method. Our methodological developments rely on the assumption below. Assumption. There exists a solution e X such that max i=,...,m g i e < 0 and fe > f. We label such a solution g-strictly feasible. Given t R, we define L t := min x X Lt, x, where Lt, x := max{fx t; g i x, i =,..., m} for all x X. Let L t denote the sub-differential of L at t. Lemma summarizes known properties of L t. Lemma Lemmas.3.4,.3.5, and.3.6 in [0]. It holds that a L t is non-increasing and convex in t; b L t 0, if t f and L t > 0, if t < f. Moreover, under Assumption, it holds that L t < 0, for any t > f.

4 4 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI c Given 0, we have L t L t + L t. Therefore, ζ L 0 for all ζ L L t and t R. Proof. We only prove the second part of Item b as the proofs of other items are provided in [0]. In particular, we show L t < 0 for any t > f under Assumption. Let x denote an optimal solution to -. If t > f we have fx t < 0. Since f is a continuous function, there exists δ > 0 such that for any x B δ x X we have fx t < 0, where B δ x is a ball of radius δ centered at x. Take x B δ x X such that x = λx + λe for some λ 0,, where e is a solution that satisfies Assumption. By the convexity of the functions g i, i =,..., m, the feasibility of x, and the g-strict feasibility of e, we have g i x λg i x + λg i e < 0 for i =,..., m. In addition, since x B δ x, we have f x t < 0. Therefore, L t max{f x t, g i x, i =,..., m} < 0. Lemma b, specifically L t < 0 for any t > f, ensures that the solution t = f is a unique root to L t = 0 and the solution x t determining L t is an optimal solution to -. In this case, it is possible to find an ɛ-relative optimal and feasible solution by closely approximating t from above with a ˆt and then finding a point ˆx X with Lˆt, ˆx < 0. The latter condition ensures ˆx is feasible. In order to obtain such a ˆt, we start with an initial value t 0 > t and generate a decreasing sequence t 0 t t t, where t k is updated to t k+ based on a term obtained by solving 3 with t = t k up to a certain optimality gap. We present this approach in Algorithm which utilizes an oracle A defined as follows. Definition. An algorithm is called an oracle A if, given t f, x X, α > and γ Lt, x L t, it finds a point x + X, such that L t αlt, x +. Algorithm Feasible Level-Set Method Input: g-strictly feasible solution e, and parameters α > and 0 < ɛ. Initialization: Set x 0 = e and t 0 = fx 0. 3 for k = 0,,,... do 4 Choose γ k such that γ k Lt k, x k L t k. 5 Generate x k+ = At k, x k, α, γ k such that L t k αlt k, x k+. 6 if αlt k, x k+ Lt 0, x ɛ then 7 Terminate and return x k+. 8 end if 9 Set t k+ = t k + Lt k, x k+. 0 end for Given a g-strictly feasible solution e and α >, Algorithm finds an ɛ-relative optimal and feasible solution by calling an oracle A with input tuple t k, x k, α, γ k at each iteration k. This oracle provides a lower bound, αlt k, x k+ 0, and an upper bound, Lt k, x k+ 0, for L t k. We use the upper bound to update t k and the lower bound in the stopping criterion. In particular, Algorithm terminates when αlt k, x k+ is greater than or equal to Lt 0, x ɛ. We discuss possible choices for oracle A in 3 and its input γ in Table. Theorem shows that Algorithm is a feasible level-set method, that is x k generated at each iteration k is feasible, and terminates with an ɛ-relative optimal solution. It also establishes its outer iteration complexity. This complexity depends on a condition measure β := L t 0 t 0 t,

5 5 L t L t t t 0 0 β t t t 0 0 β L t 0 t L t 0 a Well conditioned case. b Poorly conditioned case. Fig. : Illustration of condition measure β for t and t 0 equals 0 and, respectively for problem -. The suboptimality of x 0 = e implies t 0 = fx 0 > t. By Lemma we can show that β 0, ]. Larger values of β signify a better conditioned problem, as illustrated in Figure, because we iteratively use a close approximation of L t to approach t from t 0. Therefore, larger L t in absolute value intuitively suggests faster movement towards t. In addition, the following inequalities can be easily verified from the convexity of L t and Lemma c: 4 β = L t 0 t 0 t L t t t, t t, t 0 ]. Theorem. Given α >, 0 < ɛ, and a g-strictly feasible solution e, Algorithm maintains a feasible solution at each iteration and ensures 5 t k+ t β t k t. α In addition, it returns an ɛ-relative optimal solution to - in at most α K := log β + α βɛ outer iterations. Proof. Notice that t 0 > t since t 0 = fe. Suppose t t k t 0 at iteration k. At iteration k + we have t k+ t = t k t + Lt k, x k+ t k t + L t k t k t t k t 0,

6 6 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI where the first equality is obtained by substituting for t k+ using the update equation for t in Algorithm, the first inequality using the definition of L t k, the second inequality from 4 with t = t k, and the final inequality from t k t 0. This implies that t k t for all k. In addition, the condition L t k αlt k, x k+ satisfied by the oracle A, α >, and Lemma b imply that Lt k, x k+ L t k /α 0. It thus follows that t k+ t k and the x k+ maintained by Algorithm is feasible for all k. Next we establish 5 and the outer iteration complexity of Algorithm. Using the definiton of t k+, α >, and the condition L t k αlt k, x k+ satisfied by oracle A, we write 6 t k+ t = t k t + Lt k, x k+ t k t + L t k α. Using 4 with t = t k to substitute for L t k in 6 gives the required inequality 5. Recursively using 5 starting from k = 0 gives 7 t k t β k t 0 t. α To obtain the terminal condition, we note that, for k K, αlt k, x k+ αl t k αt k t α β K t 0 t α ɛ α L t 0 ɛlt 0, x, where the first inequality holds by the definition of L t; the second using 4 for t = t k ; the third using 7; the forth by applying the definitions of K and β. The last inequality follows from the condition satisfied by the oracle at t 0 and x, that is, αlt 0, x L t 0. Hence, Algorithm terminates after K outer iterations. Finally we proceed to show that the terminal solution is ɛ-relative optimal. Observe that 8 fx k+ f = fx k+ t t k t, where the equality is a consequence of the definition of t and the inequality is obtained using fx k+ t k, which holds because fx k+ t k Lt k, x k+ 0. Combining t k t t 0 t L t k /L t 0 from 4 and 8, we obtain fx k+ f t 0 t L t k /L t 0. Rearranging terms results in the inequality fx k+ f fe f L t k L t 0, where we use t = f and t 0 = fe. Indeed, L t k /L t 0 ɛ when the algorithm terminates. To verify this, the terminal condition and the property satisfied by oracle A indicate that L t k αlt k, x k+ Lt 0, x ɛ L t 0 ɛ, which further implies L t k L t 0 ɛ.

7 First-order oracles. In this section, we adapt first-order methods to be used as the oracle A in Algorithm to solve 3 under different properties of the convex functions f and g i, i =,..., m. We adapt these methods in a manner that facilitates our adaptive complexity analysis in 4. We assume functions f and g i, i =,..., m, are µ-convex for some µ 0. Namely, for h = f, g,..., g m, we have 9 hx hy + ζ h, x y + µ x y for any x and y in X and ζ h hy, where hy denotes the set of all sub-gradients of h at y X and denotes the -norm. We call functions f and g i, i =,..., m, convex if µ = 0 and strongly convex if µ > 0. Let D := max x X x and M := max x X max{ ζ f ; ζ gi, i =,..., m}, where ζ f fx and ζ gi g i x, i =,..., m. When f and g i, i =,..., m are smooth, M := max x X max{ fx ; g i x, i =,..., m}. We make Assumption in the rest of the paper. Assumption. M < + and either i µ > 0 or ii µ = 0 and D < +. Assuming M to be finite is a common assumption in the literature when developing first-order methods for solving smooth min-max problems see e.g. [6, 4] and nonsmooth problems see e.g. [9, 8, ]. The assumption of finite D is present to obtain computable stopping criteria for the first-order oracles that we use. 3.. Smooth convex functions. In this subsection, in addition to assumptions and, we also assume that the functions in the objective and constrains of - are smooth, formally stated below. Assumption 3. Functions f and g i, i =,..., m, are S-smooth for some S 0, namely, for h = f, g,..., g m, hx hy + hy, x y + S x y for any x and y in X. A first-order method achieves a fast linear convergence rate when the objective function is both smooth and strongly convex [0]. Despite this smoothness assumption, since the function Lt, x is a maximum of functions modeling the objective and constraints, it may be non-smooth in x. Nevertheless, we can approximate Lt, x by a smooth and strongly convex function of x using a smoothing method [6, ] and a standard regularization technique. Applying first-order methods to minimize this smooth approximation of Lt, x leads to a lower iteration complexity than directly working with the original non-smooth function. The technique we utilize is a modified version of the smoothing and regularization reduction method by Allen- Zhu and Hazan []. However, since the formulation and the non-smooth structure of 3 are quite different from the problems considered in [], we present below the reduction technique and conduct the complexity analysis under our framework. The exponentially smoothed and regularized version of Lt, x is 0 L σ,λ t, x := σ log expσfx t + m expσg i x + λ x, where σ > 0 and λ 0 are smoothing and regularization parameters, respectively. Lemma is based on related arguments in [6] and summarizes properties of L σ,λ t, x and its closeness to Lt, x. i=

8 8 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI Lemma. Suppose assumptions and 3 hold. Given σ > 0 and λ 0, the function L σ,λ t, x is i σm + S-smooth and ii λ + µ-convex. Moreover, we have 0 L σ,λ t, x Lt, x logm + σ + λd. Here, we follow the convention that 0 + = 0 so that λd = 0 if λ = 0 and D = +. Proof. Item i and the relationship are respectively minor modifications of Proposition 4. and Example 4.9 in [6]. Item ii directly follows from the µ-convexity of f, g i, i =,..., m and λ-convexity of the function λ x. We solve the following version of 3 with Lt, x replaced by L σ,λ t, x: min x X L σ,λt, x. Specifically, we apply a variant of Nesterov s accelerated gradient method [5, 6] for a given t > t and a sequence of σ, λ with σ increasing to infinity and λ decreasing to zero. We refer to this gradient method as restarted smoothing gradient method RSGM and summarize its steps in Oracle. The choice of σ, λ and the number of iterations performed by RSGM depend on whether f and g i, i =,..., m are strongly convex µ > 0 or convex µ = 0. Moreover, RSGM is re-started after N i accelerated gradient method iterations with a smaller precision γ i until this precision is below the required threshold of αlt, x i. Theorem 3 establishes that Oracle returns a solution x + X such that L t αlt, x +. Hence, RSGM can be used as an oracle A see Definition. Theorem 3. Suppose assumptions,, and 3 hold. Given level t t, t 0 ], solution x X, parameters α > and γ Lt, x L t, Oracle returns a solution x + X that satisfies L t αlt, x + in at most 40S µ + α γ log αl t + 60α logm + + 3M µ αl t iterations of the accelerated gradient method when µ > 0 and 0Sα D αl t + 4MDα 80 logm + α γ αl + log t αl t + such iterations when µ = 0 and D < Proof. Let x σ,λ t := arg min x X L σ,λ t, x and L σ,λ t := L σ,λt, x σ,λ t. We 304 will first show by induction that at each iteration i = 0,,,... of Oracle the 305 following inequality holds Lt, x i L t γ i. The validity of for i = 0 holds because x 0 = x, γ 0 = γ, and we know Lt, x L t γ. Suppose holds for iteration i. We claim that it also holds for iteration i + in both the strongly convex and convex cases. In fact, at iteration i +, we have Lt, x i+ L t L σi,λ i t, x i+ L σ i,λ i t + logm + + λ id σ i

9 Oracle Restarted Smoothing Gradient Method: RSGMt, x, α, γ Input: Level t, initial solution x X, constant α >, and initial precision γ Lt, x L t. Set x 0 = x and γ 0 = γ. 3 for i = 0,,,... do 4 if γ i αlt, x i then 5 Terminate and return x + = x i. 6 end if 7 Set σ i = 4 logm + γ i if µ > 0 8 logm + γ i if µ = 0 and λ i = 0 if µ > 0 γ i 4D if µ = 0. 8 Starting at x i, apply the accelerated gradient method [6, Algorithm ] on the optimization problem min x X L σi,λ i t; x for { 40S max µ, M } 60 logm + if µ > 0 µγi N i = { 60SD max, 4MD } 80 logm + if µ = 0, γ i iterations and let x i+ be the solution returned by this method. 9 Set γ i+ = γ i /. 0 end for γ i σ im + S x i x σ i,λ i t logm + Ni + + λ id σ i 4σ im + SL σi,λ i t, x i L σ i,λ i t logm + + 4σ im + S µ + λ i N i = 5γ iσ i M + S µ + λ i Ni γ i 8 + γ i 8 + γ i 4 = γ i = γ i+. µ + λ i Ni logm + γ i + + λ id σ i + γ i 4 + λ id σ i logm λ id σ i The first inequality follows from Lemma ; the second inequality from the σ i M + S-smoothness of L σi,λ i t, x and the convergence property, i.e., L σi,λ i t, x i+ L σ i,λ i t σ Ni i M + S x i x σ i,λ i t, of an accelerated gradient method [6, Corollary ]; the third using the λ i + µ-convexity of L σi,λ i t, x in x; and the fourth by combining Lemma and the induction hypothesis that Lt, x i L t γ i. The

10 0 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI rest of the relationships follow from the inequalities 5γ i σ i M /µ + λ i Ni γ i/8 and 5γ i S/µ + λ i Ni γ i/8, which hold for both µ > 0 and µ = 0 according to the definitions of σ i, λ i, and N i. It can be easily verified that Oracle terminates in at most I iterations, where α γ I := log αl t + 0. In fact, according to, for any i I, we have Lt, x i L t γ i = γ 0 i γ I α α L t, which implies that αlt, x i L t. Hence γ i α α αlt, x i = αlt, x i and Oracle terminates. We next analyze the total number of iterations across all calls of the accelerated gradient method [6, Algorithm ] during the execution of Oracle. Suppose µ > 0. The number of iterations taken by Oracle can be bounded as follows: I I I 40S N i µ + 60 logm + I M + µγ i i=0 = = = i=0 40S 40S 40S i=0 i=0 µ + 60 logm + I / I + M µ γ i=0 i µ + 60 logm + I I + M i/ µγ 0 i=0 µ + 60 logm + I/ I + M µ γ α γ log αl t + + 3M 40S µ + 60α logm + µ αl t. The first inequality above is obtained using the inequality a a+ for a R +, and by bounding the maximum determining N i by a sum of its terms; the first equality from simplifying terms; the second equality from using the definition of γ i ; the third equality from γ 0 = γ and the geometric sum formula I i=0 i/ = I/ / ; and the last inequality by substituting for I, using the relationship a + a + for a R + to obtain the inequality 3 / / I/ α γ = αl t + α γ αl, t and bounding / above by 3. Suppose µ = 0. The iteration complexity can be bounded using the following steps: I I I 0S 80 logm + I N i 4D + 4MD + γ i=0 i=0 i γ i=0 i i=0 I 0S 80 logm + I = 4D i/ + 4MD i + I γ 0 γ i=0 0 i=0 0S I/ 80 logm + I = 4D γ 0 + 4MD + I γ 0

11 Sα D αl t + 4αMD 80 logm + α γ αl + log t αl t +. The first inequality follows from replacing the maximum in the definition of N i by the sum of its terms and using a a + for a R +, the first and second equalities by rearranging terms and using the geometric sum formula, respectively, and the last inequality by substituting for I and using the inequality Non-smooth functions. In this subsection, we only enforce assumptions and. In other words, the functions f and g i i =,..., m can be non-smooth. Under Assumption, we have max x X ζ L,t x M for any ζ L,t x x Lt, x and t R, where x Lt, x denotes the set of all sub-gradients of Lt, at x with respect to x. As oracle A, we choose the standard sub-gradient method with varying stepsizes from [9] and its variant in [3], respectively, when µ = 0 and µ > 0 to directly solve 3. We present this sub-gradient method for both these choices of µ in Oracle. The iteration complexity of Oracle is well-known and given in Proposition without proof. The proof for µ = 0 can be found in [9,.] and and the one for µ > 0 is given in [3, Section 3.]. Here we only provide arguments to show that Oracle will return a solution with the desired optimality gap required by Algorithm, which then qualifies this algorithm to be used as an oracle A see Definition. Oracle Sub-gradient Descent: SGDt, x, α Input: Level t, initial solution x X and constant α > Set x 0 = x. 3 for i = 0,,,... do 4 Set M if µ > 0, if µ > 0, η i = i + µ and E i := i + µ D DM if µ = 0, if µ = 0. i + M i + 5 Set i j=0 x i = i + i + j + x j if µ > 0, i j=0 i + x j if µ = 0. 6 Let x i+ = arg min x X x x i + η i ζ L,t x i. 7 if E i αlt, x i then 8 Terminate and return x + = x i. 9 end if 0 end for Proposition [3, 9]. Given level t t, t 0 ], solution x X, and a parameter α >, Oracle returns a solution x + X that satisfies L t αlt, x + in at The original algorithms proposed in [3] and [9] are stochastic sub-gradient methods. Here, we apply these methods with deterministic sub-gradient.

12 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI most 4 αm µ αl t iterations of the sub-gradient method when µ > 0 and at most 5 such iterations when µ = 0 and D < +. 8α M D αl t Proof. We first consider the case when µ = 0 and D < +. Oracle is the subgradient method in [9] with the distance-generating function. From equation.4 in [9], we have Hence, when i 8α M D α L t Lt, x i L t DM i + = E i. implies L t αlt, x i. Hence, E i α α, we have Lt, x i L t E i α α L t, which L t αlt, x i which indicates 8α that Oracle terminates in no more than M D α L t iterations. Now, suppose µ > 0. Oracle, in this case, is the sub-gradient method in [3]. By Section 3. in [3], we have Lt, x i L t M i + µ. αm Therefore, when i µα L t, we have Lt, x i L t E i α α L t, which implies L t αlt, x i. In addition, the algorithm terminates after at most this many iterations because E i α α αlt, x i αlt, x i. This proposition indicates that the sub-gradient method in [9] and its variant in [3] qualify as oracles A in Algorithm when µ = 0 and µ > 0, respectively see Definition. Since the iteration complexities 4 and 5 do not depend on the initial solution x, the worst-case complexity does not restrict this choice of x X. In addition, γ is not needed for executing Oracle. Therefore, we omit it as an input to this algorithm even though the generic call to the oracle A in Algorithm includes γ. 4. Overall iteration complexity. We present an adaptive complexity analysis to establish the overall complexity of Algorithm when using the oracles A discussed in 3. We present two lemmas that are needed to show our main complexity result in Theorem 4. Lemma 3. Suppose Algorithm terminates at iteration K. Then for any k K, we have 6 L t k < L t 0 ɛ α. Proof. We first use the relationship t k = t k + Lt k, x k to show L t k L t k for any k. Defining x k := arg min x X Lt k, x, we proceed as follows: L t k max{fx k t k Lt k, x k ; g i x k, i =,..., m}

13 = max{fx k t k ; g i x k + Lt k, x k, i =,..., m} Lt k, x k L t k Lt k, x k L t k, where the second inequality holds since Lt k, x k α L t k 0 and the third inequality follows from the inequality L t k Lt k, x k. Since Algorithm has not terminated at iteration k for k K, we have 8 L t 0 ɛ αlt 0, x ɛ > α Lt k, x k α L t k, where the first inequality holds by the condition satisfied by oracle A, the second by the stopping criterion of Algorithm not being satisfied, and the third by the definition of L t k. The inequality 6 then follows by combining 7 and 8. Lemma 4. Suppose Algorithm terminates at iteration K. The following hold:. K L t k 4α 3 β L t 0 ɛ ;. K L t k 4 α β.5 L t 0 ɛ ; 3. K L t k 6α 5 3β 3 L t 0 ɛ. Proof. We only prove Item. Items and 3 can be derived following analogous steps. We have K L t k βt k t = β α K k βt K t βl t K α βl t 0 ɛ α βl t 0 ɛ 4α 3 β L t 0 ɛ, β K k α β K k α [ ] β α K+ β α where the first inequality follows by 4; the second by 5; the third by 4; the fourth using 6 with k = K; the first equality by the geometric sum; and the fifth inequality by dropping negative terms.

14 4 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI Table : Choices of the oracle A and γ k in Algorithm Smooth Asms,, & 3 Non-smooth Asms & Case Oracle Choice of γ k Strongly convex µ > 0 Convex µ = 0 & D < + MD Strongly convex µ > 0 N/A Convex µ = 0 & D < + N/A any γ L t 0 if k = 0 αlt k, x k if k We summarize the choice of oracle A and γ k in each iteration of Algorithm in Table asms abbreviates assumptions and characterize the overall iteration complexity measured by the total number of gradient iterations of this algorithm in Theorem 4. The dependence on ɛ in our big-o complexity for solving the constrained convex optimization problem - and the known analogue of this complexity for the unconstrained version of this problem are interestingly the same. However, unlike the unconstrained case, the complexity terms in Theorem 4 depend on the condition measure β. Theorem 4. Suppose the oracle A and γ k in Algorithm are chosen according to Table. Given 0 < ɛ, α >, and a g-strictly feasible solution e, Algorithm returns an ɛ-relative optimal and feasible solution to - in at most O β.5 L t 0 ɛ + log + log β α βɛ L, t 0 ɛ O β L, O t 0 ɛ β L, and O t 0 ɛ β 3 L t 0 ɛ gradient iterations, respectively, when the functions f and g i, i =,..., m are smooth strongly convex, smooth convex, non-smooth strongly convex, and non-smooth convex. Notice that we have only presented the terms ɛ, β and L t 0 in our big-o complexity. The exact numbers of iterations are given explicitly in the proof. Remark. Based on Table, when the functions f, g i, i =,..., m, are smooth and µ > 0, the input γ used in the first call of Oracle should be bounded below by L t 0. We do not specify the choice of such γ 0 in our complexity bound see 3. However, one simple choice of γ 0, in this case, could be µ fe since { L t 0 min max x X min x X µ fe, fe, x e + µ } x e, g i x, i =,..., m { fe, x e + µ x e } 464 where the first inequality follows from 9 and the third holds since x = e { µ fe solves min x R n fe, x e + µ 465 x e }.

15 Proof. [Smooth strongly convex case] We first show that we can guarantee γ k Lt k, x k L t k by choosing γ 0 L t 0 and γ k = αlt k, x k for k. Hence γ k can be used as an input to Oracle. When k = 0, we have Lt 0, x 0 = 0 and hence Lt 0, x 0 L t 0 = L t 0 γ 0. Take k. We show that Lt k, x k = max{fx k t k Lt k, x k ; g i x k, i =,..., m} = max{fx k t k ; g i x k + Lt k, x k, i =,..., m} Lt k, x k Lt k, x k Lt k, x k 0, where the first and the second inequalities hold because Lt k, x k α L t k 0. Furthermore, since L t is a decreasing function in t we have L t k L t k since t k t k. Combining this with 9 yields Lt k, x k L t k L t k αlt k, x k = γ k. We next characterize the iteration complexity of oracle A in each iteration of Algorithm using Theorem 3. By virtue of this theorem, the total number of gradient iterations performed by Algorithm is at most = 0 [ 40S 40S µ + + 3M µ + α γ k log αl t k + + 3M α γ 0 [log αl t α logm + µα L t k, k= ] 60α logm + µ αl t k α α ] Lt k, x k log αl t k where we have replaced γ k by αlt k, x k for k > 0, and used the inequality αl t k αl t k α αlt k, x k. The first sum in 0 can be bounded as follows. k= α α Lt k, x k log αl t k K α α Lt k, x k = log αl t k k= α K α = log Lt 0, x α L t K α K α log Lt 0, x α L t K [ α α = K log α K k= + log Lt0, x L t K Lt k, x k+ L t k ]

16 6 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI α α α log β + log α βɛ α + log α where the first inequality holds because Lt k, x k+ /L t k for all k = 0,,..., K, α and the second inequality follows from setting K equal to log β +, α using 6 for k = K, and the inequality L t 0 Lt 0, x 0. The second sum in 0 can be upper bounded using Item of Lemma 4. Specifically, 60α logm + 3M αµl t k Mα 30α logm + β.5 µ αl t 0 ɛ. Adding and, we obtain the following upper bound on the total number of iterations 0 required by Algorithm : [ 40S α α µ + α α log β + log α βɛ + log α ɛ ] / α γ 0 + log αl t Mα 30α logm + 3 β.5 µ αl. t 0 ɛ [Smooth convex case] Notice that γ k = MD can be used as an input to Oracle at every iteration k since Lt k, x k L t k ζ L,tk x k, x k x k ζ L,tk x k x k x k MD, where the first inequality follows from the convexity of Lt, x in x, the second using the Cauchy-Schwartz inequality, and the last using the definition of M and D. By theorems and 3, the total number of iterations required by Algorithm is equal to [ 0αS D αl t k + 4MDα ] 80 logm + α γ k αl + log t k αl t k +. An upper bound on this quantity is [ 0αS D αl t k + 4MDα 80 logm + αl + t k βɛ MDα αl t k which is obtained by replacing γ k = MD for all k and using the inequality loga a a for a >. Next using Lemma 4, we obtain the required upper bound on the total number of iterations taken by Algorithm as 48Dα β.5 + 8α β.5 0Sα αl t 0 ɛ / + MDα /. αl t 0 ɛ 6MDα 4 β αl 80 logm + / t 0 ɛ ɛ ],,

17 [Non-smooth strongly convex case] By Lemma 4 and Proposition, the total number of iterations required by Algorithm can be bounded above in a straightforward manner: αm µ αl t k 4M α 4 µβ αl t 0 ɛ. [Non-smooth convex case] In a similar fashion, by Lemma 4 and Proposition, the total number of iterations required by Algorithm is upper bounded by 8α M D αl t k 43M D α 7 β 3 αl t 0 ɛ REFERENCES [] Z. Allen-Zhu and E. Hazan, Optimal black-box reductions between optimization objectives, arxiv preprint arxiv: , 06. [] A. Y. Aravkin, J. V. Burke, D. Drusvyatskiy, M. P. Friedlander, and S. Roy, Level-set methods for convex optimization, arxiv preprint arxiv: , 06. [3] A. Y. Aravkin, J. V. Burke, and M. P. Friedlander, Variational properties of value functions, SIAM Journal on Optimization, 3 03, pp [4] A. Bayandina, Adaptive mirror descent for constrained optimization, arxiv preprint arxiv: , 07. [5] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM Journal on Imaging Sciences, 009, pp [6] A. Beck and M. Teboulle, Smoothing and first order methods: A unified framework, SIAM Journal on Optimization, 0, pp [7] K. M. Bretthauer and B. Shetty, The nonlinear knapsack problem algorithms and applications, European Journal of Operational Research, 38 00, pp [8] M. Chiang, Geometric programming for communication systems, Foundations and Trends in Communications and Information Theory, 005, pp. 54. [9] R. M. Freund and H. Lu, New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure, arxiv preprint arxiv:5.0974, 05. [0] B. Grimmer, Radial subgradient method, arxiv preprint arxiv: v, 07. [] Z. Harchaoui, A. Juditsky, and A. Nemirovski, Conditional gradient algorithms for normregularized smooth convex optimization, Mathematical Programming, 5 05, pp. 75. [] A. Juditsky and A. Nemirovski, First-order methods for nonsmooth convex large-scale optimization, i: General purpose methods, in Optimization for Machine Learning, S. Sra, S. Nowozin, and S. Wright, eds., MIT Press, Cambridge, MA, USA, 0, pp. 48. [3] S. Lacoste-Julien, M. Schmidt, and F. Bach, A simpler approach to obtaining an O/t convergence rate for the projected stochastic subgradient method, arxiv preprint arxiv:.00, 0. [4] G. Lan, Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization, Mathematical Programming, 49 05, pp. 45. [5] G. Lan and Z. Zhou, Algorithms for stochastic optimization with expectation constraints, arxiv preprint arxiv: , 06. [6] C. Lemaréchal, An extension of davidon methods to non differentiable problems, Nondifferentiable Optimization, 975, pp [7] C. Lemaréchal, A. Nemirovskii, and Y. Nesterov, New variants of bundle methods, Mathematical programming, , pp. 47. [8] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 9 009, pp [9] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, Robust stochastic approximation approach to stochastic programming, SIAM Journal on Optimization, 9 009, pp

18 8 QIHANG LIN, SELVAPRABU NADARAJAH, AND NEGAR SOHEILI [0] Y. Nesterov, Introductory lectures on convex optimization: A basic course, Kluwer, Netherlands, 004. [] Y. Nesterov, Smooth minimization of non-smooth functions, Mathematical Programming, , pp [] Y. Nesterov, Primal-dual subgradient methods for convex problems, Mathematical programming, 0 009, pp. 59. [3] M. Patriksson and C. Strömberg, Algorithms for the continuous nonlinear resource allocation problem new implementations and numerical studies, European Journal of Operational Research, 43 05, pp [4] E. Pee and J. O. Royset, On solving large-scale finite minimax problems using exponential smoothing, Journal of optimization theory and applications, 48 0, pp [5] J. Renegar, Efficient subgradient methods for general convex optimization, SIAM Journal on Optimization, 6 06, pp [6] P. Tseng, On accelerated proximal gradient methods for convex-concave optimization, submitted to SIAM Journal on Optimization, 008. [7] E. Van Den Berg and M. P. Friedlander, Probing the pareto frontier for basis pursuit solutions, SIAM Journal on Scientific Computing, 3 008, pp [8] E. Van den Berg and M. P. Friedlander, Sparse optimization with least-squares constraints, SIAM Journal on Optimization, 0, pp [9] P. Wolfe, A method of conjugate subgradients for minimizing nondifferentiable functions, 975, pp

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the