arxiv: v2 [math.oc] 11 Jun 2018

Size: px

Start display at page:

Download "arxiv: v2 [math.oc] 11 Jun 2018"

Ross Hood
5 years ago
Views:

1 : Tiht Automated Converence Guarantees Adrien Taylor * Bryan Van Scoy * 2 Laurent Lessard * 2 3 arxiv:83.673v2 [math.oc] Jun 28 Abstract We present a novel way of eneratin Lyapunov functions for provin linear converence rates of first-order optimization methods. Our approach provably obtains the fastest linear converence rate that can be verified by a quadratic Lyapunov function (with iven states), and only relies on solvin a small-sized semidefinite proram. Our approach combines the advantaes of performance estimation problems (PEP, due to Drori & Teboulle (24)) and interal quadratic constraints (IQC, due to Lessard et al. (26)), and relies on convex interpolation (due to Taylor et al. (27c;b)).. Introduction In this wor, we study first-order methods for solvin the (unconstrained) minimization problem minimize f(x) (P) x R d where f : R d R. In the sequel, we focus on the case where f is L-smooth and µ-stronly convex, thouh our methodoloy can be adapted to a broader class of problems. To solve (P), we consider methods that iteratively update their estimate of the optimizer usin only radient evaluations. One possibility for provin converence of such methods is by findin Lyapunov functions. A Lyapunov function can be interpreted as definin an enery that decreases eometrically with each iteration of the * Equal contribution IRIA, Département d informatique de l ES, École normale supérieure, CRS, PSL Research University, Paris, France 2 Wisconsin Institute for Discovery, University of Wisconsin Madison, Madison, Wisconsin, USA 3 Department of Elecrical and Computer Enineerin, University of Wisconsin Madison, Madison, Wisconsin, USA. Correspondence to: Adrien Taylor <adrien.taylor@inria.fr>, Bryan Van Scoy <vanscoy@wisc.edu>, Laurent Lessard <laurent.lessard@wisc.edu>. Proceedins of the 35 th International Conference on Machine Learnin, Stocholm, Sweden, PMLR 8, 28. Copyriht 28 by the author(s). method, with an enery of zero correspondin to reachin the optimal solution of (P). The existence of such an enery function thus provides a straihtforward certificate of linear converence for the iterative method. In this paper, we present an automated way of eneratin quadratic Lyapunov functions for certifyin linear converence of first-order iterative methods to solve (P). The procedure relies on solvin a small-sized semidefinite proram (SDP) so it is computationally efficient. Moreover, the procedure is tiht, meanin that if the SDP is infeasible, then no such quadratic Lyapunov function exists. Our results unify recent SDP-based wors for certifyin converence of first-order methods, namely: performance estimation problems (Drori & Teboulle, 24; Taylor et al., 27c) and interal quadratic constraints from robust control (Lessard et al., 26), usin smooth stronly convex interpolation (Taylor et al., 27c). These connections are further discussed in Section Oranization The paper is oranized as follows. We describe the class of methods under consideration and basic properties of Lyapunov functions in Sections 2 and 3 respectively. Our main results are then presented in Section 4, which also features numerical examples and comparisons to other approaches. The correspondin proof is presented in Section 5. Finally, we explore extensions of our approach in Section 6, and conclude in Section Preliminaries A function f : R d R is called L-smooth if its radient is Lipschitz continuous with parameter L, i.e., f(x) f(y) L x y for all x, y R d. () Furthermore, f is called convex if f(x) f(y) + f(y) T (x y) for all x, y R d, (2) and µ-stronly convex if f(x) µ 2 x 2 is convex. The set of L-smooth and µ-stronly convex functions is denoted F µ,l, and we define κ := L µ, the correspondin condition number.

2 When f F µ,l with < µ L, optimization problem (P) has a unique minimizer denoted x := ar min x f(x). The function and radient values at optimality are denoted f := f(x ) and := f(x ) = d, respectively. 2. First-Order Iterative Fixed-Step Methods To solve the optimization problem (P), we consider firstorder iterative fixed-step methods of the form y = x + = γ j x j j= β j x j α f(y ) j= (M) for where α, β j, γ j are the (fixed) step-sizes and x j R d for j =,..., are the initial conditions. We call the constant the deree of the method. Many first-order optimization methods are of the form (M), includin: the Gradient Method, Heavy Ball Method (Polya, 964), Fast Gradient Method for smooth stronly convex minimization (esterov, 24), Triple Momentum Method (Van Scoy et al., 28), and Robust Momentum Method (Cyrus et al., 28). For method (M) to solve (P), it must have a fixed-point at the optimizer x. Hence, we require the step-sizes to satisfy β j = j= and γ j =. j= For simplicity, let us define the concatenated error vectors at iteration as x, R (+)d and f R + with x := [ (x x ) T... (x x ) T] T := [ ( ) T... ( ) T] T f := [ (f f )... (f f ) ] T (3a) (3b) (3c) where x R d are the iterates, f := f(y ) R are the function values, and := f(y ) R d are the radient values. ote that we shifted (x,, f ) so that the optimal solution corresponds to (x,, f ) = (,, ). 3. What is a Lyapunov Function? Lyapunov functions are one of the fundamental tools in control theory that can be used to verify stability of a dynamical system (Kalman & Bertram, 96a;b). Consider applyin method (M) to solve problem (P). Our oal is to find the smallest possible ρ < such that {x } converes linearly to the optimizer x with rate ρ. A Lyapunov function V is a continuous function V : R n R that satisfies the followin properties:. (nonneative) V(ξ) for all ξ, 2. (zero at fixed-point) V(ξ) = if and only if ξ = ξ, 3. (radially unbounded) V(ξ) as ξ, 4. (decreasin) V(ξ + ) ρ 2 V(ξ ) for, where ξ := (x,, f ) is the state of the system at iteration. The state at iteration includes past iterates, function values, and radient values from iterations up to. If we can find such a V, then it can be used to show that the state converes linearly to the fixed-point from any initial condition (the rate of converence depends on both ρ and the structure of V). Lyapunov functions are typically found by searchin over a parameterized family of functions (called Lyapunov function candidates). In the simple case where the state {ξ } is enerated by a linear dynamical system, one can search over quadratic Lyapunov function candidates by solvin a semidefinite proram, as illustrated in Example below. Example (Quadratic Lyapunov function). Consider the linear dynamical system described by ξ + = Aξ, ξ R n with fixed-point ξ R n (i.e., ξ = Aξ ). Suppose that feasible P S n A T P A ρ 2 P, P (4) has solution P. Then a Lyapunov function for the system is V(ξ) = (ξ ξ ) T P (ξ ξ ) which can be used to show that ξ ξ linearly with rate ρ. Specifically, we have the bound ξ ξ P ρ ξ ξ P for. To find the best bound, we can perform a bisection search on ρ to find the smallest ρ such that (4) is feasible. ote that althouh V depends explicitly on the fixed point ξ, we do not need to now ξ to solve the SDP (4). The linear dynamical system of Example converes linearly if and only if a quadratic Lyapunov function exists, which happens if and only if the SDP (4) is feasible (Lyapunov & Fuller, 992; Vidyasaar, 22). 4. Main Results Similar to Example, we now show how to use quadratic Lyapunov functions to prove linear converence of a firstorder iterative fixed-step method applied to the minimization of a smooth stronly convex function. Furthermore, we show that such Lyapunov function exists if and only if a small-sized semidefinite proram is feasible (whose optimal solution produces the Lyapunov function).

3 4.. Quadratic Lyapunov Functions We bein with sufficiency: if we can find a quadratic Lyapunov function, we can use it to prove linear converence. Lemma 2 (Quadratic Lyapunov function). Consider applyin the first-order iterative fixed-step method (M) of deree to a smooth stronly convex function f F µ,l (R d ) with < µ L. Define the state ξ := (x,, f ) as in (3). Consider the quadratic function V(ξ ) = [ ] T [ ] x x (P I d ) + p T f for (5) with parameters P S 2(+) and p R +, and where denotes the Kronecer product. Suppose V is a Lyapunov function for the system with rate ρ. Then, the followin bound is satisfied: V(ξ ) ρ 2( ) V(ξ ) for. (6) Proof. Suppose V is a Lyapunov function for method (M) with f F µ,l. Then V(ξ i+ ) ρ 2 V(ξ i ) for i. Multiplyin this inequality by ρ 2( i ) and summin over i =,..., ives a telescopin sum that yields (6). As a consequence to Lemma 2, we have the relations: x x = O(ρ ) f(y ) = O(ρ ) f(y ) f = O(ρ 2 ) where x R d is the optimizer of (P) and f := f(x ). (7a) (7b) (7c) Remar 3. The Lyapunov function (5) is only defined for since the state ξ is a function of the previous function and radient values. This is why the bound (6) is expressed in terms of V(ξ ). Remar 4. The states used in the Lyapunov function (5) can be modified to include other iterates (such as y ) in the quadratic term as well as the function and radient values evaluated at iterates other than y. We chose the form in (5) because it contains all necessary inredients while also bein straihtforward to eneralize to other cases. In addition, note that the structure of (5) maes it permutation-invariant (i.e., it does not depend on the orderin of the coordinate set). This is larely motivated by the fact that there is no reason to favor any coordinate amon R d. Lemma 2 shows that if we can find a quadratic Lyapunov function, then we can use this to prove linear converence of method (M) when f F µ,l. In the followin section, we construct an SDP whose feasibility is necessary and sufficient for the existence of such a Lyapunov function SDP for Quadratic Lyapunov Functions Given parameters α, β j, and γ j for a method (M) of deree and a rate ρ to be verified, we construct the semidefinite proram as follows. Step : Initialization. First, we initialize the row vectors x (K), R +K+2 (K) and f R K+, correspondin to the initial conditions, radient values, and function values, respectively, as x (K) := e T ++ for {,..., } (8a) := e T ++2 for {,..., K} (8b) f (K) := e T + for {,..., K} (8c) for K {, + } (e i denotes the i th unit vector with appropriate dimension). These form a basis for all iterates, function values, and radient values up to iteration K. Also, define the row vectors correspondin to the fixed-point as ȳ (K) := T +K+2, ḡ (K) := T (K) +K+2, f := T K+. We also introduce the followin SDP variables: P S 2(+), λ R for i, j I, p R +, η R for i, j I +, where I K := {,,..., K, } is an index set. Step 2: Method. ext, we iterate the method for =,..., K usin the row vectors we previously defined. ȳ (K) = x (K) j= + = j= γ j x (K) j (9a) β j x (K) j α ḡ(k). (9b) Step 3: Interpolation conditions. Usin the computed vectors, define m (K) R K+ and M (K) S +K+2 as m (K) M (K) := 2 := (L µ) ( (K) f i ȳ (K) i ȳ (K) j i j T M (K) f j ȳ (K) i ȳ (K) j i j ) T (a) (b) The terms M (K) and m (K) are related to interpolation by smooth stronly convex functions as discussed in Section 5.

4 for i, j I K where µl µl µ L M := µl µl µ L µ µ. () L L Step 4: Lyapunov function. We now construct the linear and quadratic terms in the Lyapunov function, denoted v (K) R K+ and V (K) S +K+2, respectively, as v (K) V (K) := where the matrices x (K) f (K) x (K) := := p T (K) f ] (K) T [ x P ] (K) [ x (2a) (2b), R (+) (+K+2) and R (+) (K+) are defined as x (K). x (K) ḡ(k) :=. (K) f := f (K). f (K) Also, define the decrease in the linear and quadratic terms of the Lyapunov function as v (K) V (K) := v (K) + ρ2 v (K) := V (K) + ρ2 V (K) where ρ is the converence rate to be verified.. (3a) (3b) Step 5: Semidefinite proram. Finally, we compute the quadratic Lyapunov function (if one exists) for a iven rate ρ by solvin the followin semidefinite proram: SDP for quadratic Lyapunov function (ρ-sdp) feasible V () P S 2(+) λ M () p R + i,j I {λ } {η } < v () λ m () i,j I V (+) v (+) λ for i, j I η for i, j I + + η M (+) i,j I + + η m (+) i,j I + Theorem 5 (Main Result). Consider applyin the firstorder iterative fixed-step method (M) of deree to a smooth stronly convex function f F µ,l (R d ) with < µ L. Let the step-sizes α, β j, and γ j be such that α, γ, and β j = γ j =. j= j= Then there exists a quadratic Lyapunov function of the form (5) with rate ρ that is valid for all d if and only if (ρ-sdp) is feasible. From Theorem 5, we can perform bisection on ρ to find the minimum ρ such that (ρ-sdp) is feasible to produce the fastest linear converence rate that is able to be verified usin a quadratic Lyapunov function with states (x,, f) Comparison to PEP and IQC Our results are closely related to several other recent approaches utilizin semidefinite prorams for studyin converence of first-order methods, which we discuss now. Performance Estimation Problem (PEP). The performance estimation approach was introduced by Drori & Teboulle (24) as a systematic way to obtain worst-case performance uarantees of a iven method. In the context of fixed-step first-order methods, performance estimation problems (PEP) can be formulated as semidefinite prorams. The ey idea in PEP is to loo for a tuple (x,..., x, f) such that the iven alorithm behaves in the worst possible way, accordin to a iven performance measure. The dual of PEP correspondin to the performance measure V(ξ + )/V(ξ ) for some fixed P and p is exactly the same as solvin min ρ ρ 2 subject to (ρ-sdp) bein feasible. The difference between PEP and our approach is that in PEP, the optimization is for a fixed performance measure carried out over multiple timesteps. This yields exact worst-case bounds, but at the cost of solvin an SDP whose size is proportional to the number of timesteps (this allows, amon others, dealin with time-varyin methods and sublinear converence rates). In our approach, for a fixed ρ, we optimize the performance measure itself. This yields a Lyapunov function with a uaranteed decrease at every iteration while () maintainin tihtness and (2) solvin a small SDP of fixed size. Both approaches ensure tihtness via (smooth) convex interpolation, developed by Taylor et al. (27c). Interal Quadratic Constraints (IQCs). Interal quadratic constraints are an analysis method for boundin the worst-case performance of dynamical systems in feedbac with nonlinearities (Meretsi & Rantzer, 997). This approach was recently adapted for use in analyzin

5 first-order optimization alorithms (Lessard et al., 26). In the optimization context, the nonlinear component is the radient of the objective function, while the dynamical system is the iterative method bein analyzed. The ey idea with IQCs is to replace the nonlinearity ( f) by quadratic constraints that it must satisfy. This is precisely the idea behind interpolation (discussed in Section 5.), which is a foundational concept in our methodoloy. The difference between IQCs and our approach is that the interpolation conditions are necessary and sufficient to characterize f when f F µ,l. However, the sector IQC and weihted off-by-one IQC used by Lessard et al. (26) are a strict subset of the interpolation conditions; they are only sufficient for describin f when f F µ,l. In particular, the IQC framewor does not use any constraints on f that explicitly involve function values. This amounts to solvin (ρ-sdp) with additional constraints on λ and η such that all the function values cancel out in the SDP umerical Comparisons To illustrate our results, we consider the Gradient Method (GM), Heavy Ball Method (HBM), Fast Gradient Method (FGM), and Triple Momentum Method (TMM). Each of these methods can be parametrized as y = x + γ (x x ) x + = x + β (x x ) α f(y ) (4a) (4b) for where x, x R d are the initial conditions, and the parameters for each method are: Method α β γ GM L ( κ ) 2 4 HBM ( L+ µ) 2 κ+ FGM L TMM 2 L µ L L κ κ+ ( κ ) 2 κ+ κ κ κ+ ( κ ) 2 2κ+ κ We use (ρ-sdp) to find correspondin Lyapunov functions. The correspondin converence rates are provided in Fiure ; the results match those obtained usin IQCs (Lessard et al., 26) for GM, HBM, and FGM, and those for TMM provided in (Van Scoy et al., 28). For more complicated cases, the performance estimation toolbox PESTO (Taylor et al., 27a) can be used to perform numerical validations. For illustrative purposes, we present results obtained usin a restricted class of Lyapunov functions. We fixed λ = in (ρ-sdp) and plotted the best achievable ρ in Fiure 2. We observe that this restricted class is not sufficient to recover the rates obtained in Fiure. Worst-case converence rate ρ GM HBM.2 FGM TMM 2 3 Condition ratio κ Fiure. Worst-case linear converence rates from (ρ-sdp). Evaluations to converence / lo ρ Condition ratio κ FGM FGM TMM TMM Fiure 2. Order of manitude of the worst-case number of iterations, which is O( / lo ρ), to solve problem (P). The bounds are obtained by searchin for Lyapunov functions of the form (5) in two cases: (i) (P, p) found usin (ρ-sdp) (solid), and (ii) restrictin P and p > (dashed). 5. Proof of Theorem Sampled Smooth Stronly Convex Functions To prove Theorem 5, we first need a result on smooth stronly convex functions that are sampled at discrete points. Indeed, inequalities () and (2) completely characterize functions that are smooth and stronly convex. However, these inequalities are defined on an infinite set of points, and it was shown in Section 2.2 of (Taylor et al., 27c) that usin them to prove converence may introduce conservatism. Therefore, in order to completely characterize points which are sampled from smooth stronly convex functions, we need the concept of interpolation.

6 The followin theorem is borrowed from (Taylor et al., 27c) and forms the basic buildin bloc for our analysis. Theorem 6 (F µ,l -interpolation). Let I be an index set, and consider the set of triples S = {(y i, i, f i )} i I where y i, i R d and f i R for all i I. There exists a function 2 f F µ,l such that f(y i ) = f i and f(y i ) = i for all i I if and only if φ for all i, j I where φ := (L µ)(f i f j ) + y j i (M I d ) y j i (5) j j with M S 4 defined in () Positive Definite Quadratics From Samplin Recall from Section 3 that the Lyapunov function must satisfy two conditions: (i) V must be positive definite (i.e, nonneative, zero at the fixed-point, and radially unbounded), and (ii) V := V + ρ 2 V must be neative semidefinite (i.e., V must satisfy the decrease condition). To prove both (i) and (ii), we use the followin theorem, which provides necessary and sufficient conditions for a quadratic form to be positive (semi-)definite when the iterates are enerated by method (M) applied to f F µ,l. Theorem 7 (Sampled positive definite quadratics). Consider applyin the first-order iterative fixed-step method (M) of deree to a smooth stronly convex function f F µ,l (R d ) for K iterations. Suppose the step-sizes α, β j, and γ j are such that α, γ, and j= y i T β j = γ j =. j= Define the vectors x R (+)d, R (K+)d, and f R K+ as x := [ (x x ) T... (x x ) T] T := [ ( ) T... ( K ) T] T y i (6a) (6b) f := [ f f... f K f ] T (6c) and denote the triple ξ := (x,, f). Define m (K) R K+ and M (K) S +K+2 such that [ ] T [ x φ (ξ) = (M (K) x I d ) ] + ( m (K) ) Tf (7) for i, j I K := {,..., K, } where φ is defined in (5). Consider the quadratic function [ ] T [ x x σ(ξ) = (Q I d ) + q ] T f 2 In other words, we say that the set S is F µ,l-interpolable. where Q S +K+2 and q R +. Suppose the dimension d satisfies 3 d + K + 2. Then σ is positive semidefinite (i.e., nonneative) if and only if there exists τ for i, j I K such that Q τ M (K) i,j I K q τ m (K). i,j I K Furthermore, σ is positive definite if and only if there exists τ for i, j I K such that Q τ M (K) i,j I K (9a) < q τ m (K). i,j I K (9b) Proof. We prove the second statement that σ is positive definite if and only if there exists τ for i, j I K such that (9) holds; the proof of the first statement is similar. Here we prove that the conditions are sufficient for σ to be positive definite; necessity is more involved and can be found in the supplementary material. (Sufficiency). Suppose there exists τ for i, j I K such that (9) holds. Clearly, we have σ() =. ow assume that ξ. Sum the followin two inequalities: (i) tae the Kronecer product of (9a) with I d and multiply the result on the left and riht by [ x T T] and its transpose, respectively, and (ii) multiply the tranpose of (9b) on the riht by f. Doin so ives the inequality < σ(ξ) i,j I K τ φ (ξ) (2) which is strict due to the strict inequalities in (9) and since ξ. Since f F µ,l, we have φ from Thm. 6, so i,j I K τ φ (ξ) < σ(ξ). Then σ(ξ), and σ(ξ) = if and only if ξ =. Finally, note that the strict inequalities in (9) imply that the riht side of (2) rows arbitrarily lare as ξ, so σ is radially unbounded. Thus, σ is positive definite. Remar 8. Theorem 7 can be seen as a specialized application of the S-procedure (Boyd et al., 994; Meretsi & Treil, 993) where the points in ξ are enerated by method (M), and the positive semidefinite quadratic terms come from the interpolation conditions in Theorem 6. While 3 This requirement is only used for necessity.

7 the S-procedure is nown to be lossy in certain cases (i.e., the conditions are sufficient but not necessary for σ to be positive (semi)definite), Theorem 7 shows that it is in fact lossless under the lare-scale assumption d + K + 2. We now apply Theorem 7 to obtain necessary and sufficient conditions for both V to be positive definite and V := V(ξ + ) ρ 2 V(ξ ) to be neative semidefinite. To that f (K) in (8) end, note that the basis vectors x (K),, and are such that [ ] x x = ( x (K) x I d ) for {,..., K} [ ] = ( x I d ) for {,..., K} f (K) f f = f for {,..., K} [ ] y x = (ȳ (K) x I d ) for {,..., K}. where x,, and f are defined in (6) and we used the iterations in (9). We can then sum the followin: (i) tae the Kronecer product of M (K) in () with I d and multiply the result on the left and riht by [ x T T] and its transpose, respectively, and (ii) multiply the transpose of m (K) on the riht by f. Addin these two quantities ives (7). Similarly, the Lyapunov function in (5) is iven by V(ξ ) = [ ] T x (V (K) I d ) [ ] x + ( v (K) ) Tf (2) and the decrease in the Lyapunov function is iven by V (ξ ) = [ ] T x ( V (K) I d ) [ ] x + ( v (K) ) Tf (22) usin the definitions in (2) and (3). This leads to the followin results. Corollary 9 (V positive definite). V in (5) is positive definite for all values of d if and only if there exists λ for i, j I such that V () λ M () i,j I < v () λ m () i,j I where M () and m () defined in (). Proof. The result follows from applyin Theorem 7 with K = to show that the quadratic function V in (2) is positive definite. Corollary ( V neative semidefinite). Consider V in (5) and define V := V(ξ + ) ρ 2 V(ξ ). Then V is neative semidefinite for all values of d if and only if there exists η for i, j I + such that V (+) + η M (+) i,j I + v (+) + η m (+) i,j I + where V (+) and v (+) are defined in (3). Proof. The result follows from applyin Theorem 7 with K = + to show that the quadratic function V in (22) is neative semidefinite. Theorem 5 then follows from combinin the results in Corollaries 9 and. In particular, the inequalities in each corollary correspond to the constraints in the semidefinite proram (ρ-sdp). If the problem is feasible, then V is positive definite and V is neative semidefinite at iteration. Since this holds for any initial condition, we can apply the result for each to show that V is a valid Lyapunov function. On the other hand, if the problem is infeasible, then there exists no quadratic function of the form (5) such that V is positive definite and V is neative semidefinite, so no valid quadratic Lyapunov function with state ξ exists for the iven rate ρ. This completes the proof of Thm Extensions Our main result in Theorem 5 applies to methods of the form (M) with fixed step-sizes applied to smooth stronly convex functions. Our framewor, however, can be extended to many other scenarios, with or without tihtness. We now proceed with some examples of how our procedure of searchin for Lyapunov functions can serve as a basis for the analysis of many exotic alorithms. We provide two such examples: (i) the analysis of variants of GM and HBM involvin subspace searches, and (ii) the analysis of a fast radient scheme with scheduled restarts. 6.. Exact Line Searches In this section, we search for quadratic Lyapunov functions when it is possible to perform an exact line search. We illustrate the procedure on steepest descent α = ar min f(x α f(x )) α x + = x α f(x ) and on a variant of HBM: (α, β) = ar min f(x + β (x x ) α f(x )) α,β x + = x + β (x x ) α f(x )

8 The detailed analyses can be found in the supplementary material, whereas the results are presented on Fiure 3. Worst-case converence rate ρ GM (subspace search) HBM (subspace search).2 GM (fixed-step of /L) TMM (fixed-step) 2 3 Condition ratio κ Fiure 3. Converence rates of GM and HBM with subspace searches. ote that the Gradient Method with exact line search matches the worst-case rate κ from (de Kler et al., 27). For κ+ comparison, the rates of the Gradient Method with step-size /L and the Triple Momentum Method are also shown. Evaluations to converence / lo ρ 3 2 = = 5 = = Condition ratio κ Fiure 4. Worst-case number of radient evaluations to converence O( / lo ρ) for different restart schedules (purple) alon with the optimal restart schedule = ar min ρ() (dashed blue). For comparison, we also plot the upper bound ( ρ( ) exp e 8κ ) (dashed blac) from (O Donohue & Candès, 25). Results are not shown for small κ due to numerical limitations of the SDP solvers Scheduled Restarts In this section, we apply the methodoloy to estimate the converence rate of FGM with scheduled restarts; motivations for this ind of techniques can be found in e.., (O Donohue & Candès, 25). We present numerical uarantees obtained when usin a version of FGM tailored for smooth convex minimization, which is restarted every iterations. This settin oes slihtly beyond the fixed-step model presented in (M), as the step-size rules depend on the iteration counter. Define β := and β i+ := + 4β 2 i + 2 ; we use the followin iterative procedure y, z y z i+ = y i L f(yi ) y i+ = z i+ + β i β i+ (z i+ z i ) (23) which does steps of the standard fast radient method (esterov, 983) before restartin. We study the converence of this scheme usin quadratic Lyapunov functions with states (y y, f(y ), f(y ) f(y )). The derivations of the SDP for verifyin V + ρ 2 V (where ρ is the converence rate of the inner loop) is similar to that of (ρ-sdp) (details in supplementary material). umerical results are provided in Fiure Conclusion In this wor, we studied first-order iterative fixed-step methods applied to smooth stronly convex functions. We presented a semidefinite formulation whose feasibility is both necessary and sufficient for the existence of a quadratic Lyapunov function. For smooth stronly convex minimization, restriction to quadratic Lyapunov functions is natural, as nonlinearities are exactly characterized by quadratic interpolation constraints. Usin other tools such as sum-of-squares Lyapunov functions (see e.., Parrilo (2)) could be beneficial for more eneral alorithm and problem classes. This methodoloy unifies two previous approaches to worstcase analyses: performance estimation due to Drori & Teboulle (24) and interal quadratic constraints due to Lessard et al. (26). Moreover, this approach admits a lare number of potential extensions, both in terms of classes of optimization problems and types of alorithms that can be analyzed (see e.., extensions for performance estimation (Taylor et al., 27b)). In particular, Lyapunov functions can be used to study sublinear converence rates (see e.., Hu & Lessard (27)), switched systems (Lin & Antsalis, 29) (e.., for adaptive methods), noisy methods (see e.., Cyrus et al. (28)), or continuous-time settins such as in (Su et al., 26). Code The code used to implement (ρ-sdp) and enerate the fiures in this paper is available at com/qcgroup/quad-lyap-first-order.

9 Acnowledments A. Taylor was supported by the European Research Council (ERC) under the European Union s Horizon 22 research and innovation proram (rant areement 72463). B. Van Scoy and L. Lessard were supported by the ational Science Foundation under Grants o and This wor was launched durin the LCCC Focus Period on Lare-Scale and Distributed Optimization oranized by the Automatic Control Department of Lund University. The authors than the oranizers. References Boyd, S. and Vandenberhe, L. Convex Optimization. Cambride University Press, 24. Boyd, S., El Ghaoui, L., Feron, E., and Balarishnan, V. Linear Matrix Inequalities in System and Control Theory, volume 5 of Studies in Applied Mathematics. SIAM, Philadelphia, PA, June 994. Cyrus, S., Hu, B., Van Scoy, B., and Lessard, L. A robust accelerated optimization alorithm for stronly convex functions. In Proceedins of the 28 American Control Conference (to appear), 28. de Kler, E., Glineur, F., and Taylor, A. B. On the worstcase complexity of the radient method with exact line search for smooth stronly convex functions. Optimization Letters, (7):85 99, Oct 27. Drori, Y. and Teboulle, M. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Prorammin, 45():45 482, 24. Hu, B. and Lessard, L. Dissipativity theory for esterov s accelerated method. In Proceedins of the 34th International Conference on Machine Learnin, volume 7, pp , 27. Kalman, R. E. and Bertram, J. E. Control system analysis and desin via the second method of Lyapunov: I Continuous-time systems. Journal of Basic Enineerin, 82(2):37 393, 96a. Kalman, R. E. and Bertram, J. E. Control system analysis and desin via the second method of Lyapunov: II Discrete-time systems. Journal of Basic Enineerin, 82 (2):394 4, 96b. Lessard, L., Recht, B., and Pacard, A. Analysis and desin of optimization alorithms via interal quadratic constraints. SIAM Journal on Optimization, 26():57 95, 26. Lin, H. and Antsalis, P. J. Stability and stabilizability of switched linear systems: A survey of recent results. IEEE Transactions on Automatic Control, 54(2):38 322, February 29. Lyapunov, A. M. and Fuller, A. T. General Problem of the Stability Of Motion. Control Theory and Applications Series. Taylor & Francis, 992. Oriinal text in Russian, 892. Meretsi, A. and Rantzer, A. System analysis via interal quadratic constraints. IEEE Transactions on Automatic Control, 42(6):89 83, June 997. Meretsi, A. and Treil, S. Power distribution inequalities in optimization and robustness of uncertain systems. Journal of Mathematical Systems, Estimation, and Control, 3 (3):3 39, 993. esterov, Y. A method for unconstrained convex minimization problem with the rate of converence o (/ˆ 2). In Dolady A USSR, volume 269, pp , 983. esterov, Y. Introductory lectures on convex optimization: A basic course. Spriner, 24. O Donohue, B. and Candès, E. Adaptive restart for accelerated radient schemes. Foundations of computational mathematics, 5(3):75 732, 25. Parrilo, P. A. Structured semidefinite prorams and semialebraic eometry methods in robustness and optimization. PhD thesis, California Institute of Technoloy, 2. Polya, B. Some methods of speedin up the converence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5): 7, 964. Su, W., Boyd, S., and Candes, E. J. A differential equation for modelin nesterovs accelerated radient method: theory and insihts. Journal of Machine Learnin Research, 7(53): 43, 26. Taylor, A. B., Hendricx, J. M., and Glineur, F. Performance estimation toolbox (PESTO): automated worst-case analysis of first-order optimization methods. In IEEE Conference on Decision and Control, 27a. Taylor, A. B., Hendricx, J. M., and Glineur, F. Exact worstcase performance of first-order methods for composite convex optimization. SIAM Journal on Optimization, 27 (3):283 33, 27b. Taylor, A. B., Hendricx, J. M., and Glineur, F. Smooth stronly convex interpolation and exact worst-case performance of first-order methods. Mathematical Prorammin, 6():37 345, Jan 27c.

10 Van Scoy, B., Freeman, R. A., and Lynch, K. M. The fastest nown lobally converent first-order method for minimizin stronly convex functions. IEEE Control Systems Letters, 2():49 54, January 28. Vidyasaar, M. onlinear systems analysis, volume 42. SIAM, 22. A. Proof of Theorem 7 (Sampled Positive Definite Quadratics) We bein by notin that σ is positive definite if and only if the optimal value of the followin problem is positive for any value of ε >, inf ξ s.t. σ(ξ) (x,, f) as in (6) enerated by method (M) applied to f F µ,l, ξ ε. ext, we discretize the problem by replacin f F µ,l with the equivalent condition that the discrete set of points {(y i, i, f i )} i IK is F µ,l -interpolable (recall that I K = {,..., K, }). Choosin a specific notion of distance for ξ, we can reformulate the previous statement as verifyin that p (d) (ε) > for all ε > where p (d) (ε) := min s.t. x,,f σ(ξ) {(y i, i, f i )} i I is F µ,l -interpolable, (x,, f) as in (6) enerated by (M), x T f = ε. ote that the last condition can equivalently be replaced by others (e.., x 2 = ε or 2 = ε or T f = ε), and that the optimal value of this problem is attained (usin a short homoeneity arument with respect to ε). Usin the necessary and sufficient conditions for the set of points {(y i, i, f i )} i I to be F µ,l -interpolable from Theorem 6, we have p (d) (ε) = min s.t. x,,f σ(ξ) φ for all i, j I K, (x,, f) as in (6) enerated by (M), x T f = ε. ext, we define the Gram matrix G S +K+2 as G := B T B where B := [ x x... x x... K ], (24) hence G is a standard Gram matrix containin all inner products between x i x for i {,..., } and i for i {,..., K}. ote that the quadratic σ can be written as a function of the Gram matrix as σ(g, f) = tr ( QG ) + q T f. Similarly, the interpolation conditions can also be reformulated with the Gram matrix as φ (G, f) = tr ( M G ) + m T f where M and m are such that [ ] T [ ] x x φ = (M I d ) + m T f, for all i, j I K (hence also is in the index set). Therefore, we can reformulate the previous problem as the followin ran-constrained semidefinite proram: p (d) (ε) = min tr ( QG ) + q T f G S +K+2,f R + tr ( M G ) + m T f for i, j I, tr(g) + T f = ε, s.t. f, G, Ran(G) d, where we remind the reader that d is the dimension of the optimization problem (P). Therefore, as discussed in (Taylor et al., 27c;b), if we want a result that does not depend on the dimension (i.e, a σ that is positive definite whatever the value of d), we have to verify that p ( ) (ε) > (which corresponds to assumin that d +K+2 since this is the dimension of G). We then have the followin semidefinite proram: p ( ) (ε) = min tr ( QG ) + q T f G S +K+2,f R + tr ( M G ) + m T f for i, j I, tr(g) + s.t. T f = ε, f, G. A Slater point for this problem (i.e., a feasible point such that G ) is obtained in the followin section, so the optimal value of the primal problem is equal to the optimal value of the dual, which is iven by d ( ) (ε) := max ν ε {λ },ν λ for all i, j I, s.t. Q i,j I λ M νi +K+2, q i,j I λ m ν +. The theorem is then proved by notin the equivalence p ( ) (ε) >, ε > d ( ) (ε) >, ε >, where the last statement amounts to verifyin that Q λ M, q λ m >. i,j I i,j I

11 B. Slater Point for Proof of Theorem 7 In this section, we show how to construct a Slater point (Boyd & Vandenberhe, 24) for the primal semidefinite proram in the proof of Theorem 7. The construction is similar to Section 2..2 of (esterov, 24) and the proof of Theorem 6 in (Taylor et al., 27c). Consider applyin the first-order iterative fixed-step method (M) with α and γ for K iterations to the function f(x) = 2 xt Hx where H S d with d + K + 2 is the positive definite tridiaonal matrix defined by 2 if i = j [H] = if i j = otherwise which has maximum eienvalue L = cos ( π/(d + ) ). Define the matrix B in (24). Usin the initial condition x i = e ++i for i =,...,, we will show that B is upper trianular with nonzero diaonal elements, and hence full ran. Since γ, y has a nonzero element correspondin to e +. Then = Hy has a nonzero element correspondin to e +2 due to the tridiaonal structure of H. Furthermore, y may only have nonzero elements correspondin to e j for j =,..., +, so must have zero components correspondin to e j for all j > + 2. We now continue by induction. Assume that has a nonzero element correspondin to e +2+ and zero elements correspondin to e i for all i > +2+. Since α, x + has a nonzero element correspondin to e +2+ and zero elements correspondin to e i for all i > Then since γ, y + also has the same structure. Due to the tridiaonal structure of H, + then has a nonzero element correspondin to e +3+ and zero elements correspondin to e i for all i > +3+. Therefore, by induction we have shown that for all, has a nonzero element correspondin to e +2+ and zero elements correspondin to e i for all i > For P to be upper trianular, we need K to have dimension at least K. Thus, if d K where K is the number of iterations, then B is upper trianular with nonzero entries on the diaonal, and therefore has full ran. In order to mae the statement hold for eneral µ < L, observe that the tridiaonal structure of H is preserved under the operation H = ( H λ min (H) I ) L µ λ max (H) λ min (H) + µi where µi H LI. Since B has full ran, the Gram matrix G = B T B is positive definite. Therefore, the primal semidefinite proram satisfies Slater s condition. C. Steepest Descent In this section, we show a similar formulation as (ρ-sdp) for steepest descent. In this case, the analysis was not a priori uaranteed to be tiht, due to the line search conditions. In order to encode the line search, we use the correspondin optimality conditions, as in (de Kler et al., 27): x + x, + =,, + =, (25) with = f(x ). For the Lyapunov function structure, we choose the followin V (ξ ) = [ ] [ ] T x x x x (P I d ) + p(f f ). In order to develop a SDP formulation for this problem we follow the same steps as for (ρ-sdp), startin with Step (see Section 4.2): we define the followin row vectors in R 2 ȳ () := e T, x () := e T, ḡ () and ȳ () = x () = ḡ () := T, alon with the scalars f () () := and f :=. In addition, we use the followin vectors in R 4 and ȳ () = x () = ḡ () R 2 () such that f := e T, x () := e T, x () ȳ () ȳ () ḡ () = e T 3, ḡ () := e T 4, := T, alon with () f := e T 2 and () f () f () (), f, f := T. Because of the alorithm, Step 2 is slihtly different as before; we encode the line search constraints (25) usin A = A 2 = [ x () x () ḡ () ḡ () ḡ () ] [ ] ḡ () T ] T [ ḡ (). x () x () ḡ (), The subsequent steps (Step 3 and Step 4) are exactly the same as in Section 4.2. We finally obtain a slihtly modified

12 version of the feasibility problem (ρ-sdp): feasible P S 2 p R ν,ν 2 R {λ } {η } V () λ M () i,j I < v () λ m () i,j I V () + η M () i,j I v () + η m () i,j I λ for i, j I η for i, j I with I := {, } and I := {,, }. + 2 ν i A i i= D. SDP for HBM with Subspace Searches We follow the steps of the previous section for steepest descent; we only mae the followin adaptations: (i) we loo for a quadratic Lyapunov function with the states x x x x V (ξ ) = x x (P I d) x x [ ] + p T f f, f f (ii) we adapt the initialization (Step ), (iii) adapt the line search conditions (Step 2) and (iv) obtain a slihtly modified version of the SDP. For (ii), we adapt the initialization procedure (Step ) as follows. We define the followin row vectors of R 4 : x () := e T, x () ȳ () := e T, ȳ () ḡ () := e T 3, ḡ () := e T 4, alon with ȳ () = x () = ḡ () := T, and the followin in R 2 () : f := e T (), f := e T () 2 and f := T. We also define the followin row vectors in R 7 : := et, := e T 3, 2 := e T 4, ȳ (2) :=, ȳ(2) :=, ȳ(2) 2 := 2, := e T 5, := e T 6, 2 := e T 7, alon with y (2) = = f (2) := e T, f (2) := T and the vectors of R 3 : (2) f 2 := e T (2) 3 and f := T. T ow for (iii) (or Step 2), optimality of the search conditions can be x + x, + =, x x, + =, ; + =, which we can formulate in matrix form for {, }: A + = A 3+ = A 5+ = [ T T ] T [ ] [ ] The subsequent steps (Step 3 and Step 4) are exactly the same as in Section 4.2. We finally obtain a slihtly modified version of the feasibility problem (ρ-sdp): feasible P S 2 p R ν,...,ν 6 R {λ } {η } V () λ M () i,j I < v () λ m () i,j I V (2) + η M (2) i,j I 2 v (2) + η m (2) i,j I 2 λ for i, j I η for i, j I ν i A i i= with I := {,, } and I 2 := {,, 2, }. The correspondin results are presented on Fiure 3. E. SDP for FGM with Scheduled Restarts This settin oes slihtly beyond the fixed-step model presented in (M), as the step sizes depend on the iteration. We study the alorithm described by (23), which does steps of the standard fast radient method (esterov, 983) before restartin. We study the converence of this scheme usin quadratic Lyapunov functions of the form [ ] y T x f(y ) (P I d ) [ ] y x f(y ) + p [f(y ) f(x )]. (26) Let us perform similar steps as for (ρ-sdp) for constructin the correspondin SDP. We start with the initialization

13 procedure Step. Let us define the followin row vectors in R 2 : ȳ () = x () := e T, ḡ () and x () = ḡ () := T, alon with the scalars f () := and f () :=. In addition, we define the row vectors of R +2 : x (+) := e T, ḡ (+) := e T 2+, for =,...,, alon with x (+) and the row vectors e T + f (+) (+) and f := T. R + defined as Lyapunov Functions for First-Order Methods = ḡ (+) := T (+) f := Step 2 Apply one complete loop of the alorithm as follows: for =,..., define the sequence of row vectors: z (+) + = ȳ (+) Lḡ(+), ȳ (+) + = z (+) + β ( z (+) z (+) ), β + with β := and β + := + 4β For complyin with the notations of the paper, we define the sequence x (K) := y (K) for =,..., and K {, +}. Then, usin the sets I := {, } and I + := {,...,, }, the other staes follow from the same lines as Step 3, and Step 4 with the sliht modification of the expression for the rate v (+) V (+) := v (+) + ρ 2 v (+), := V (+) + ρ 2 V (+), and Step 5 follows as in Section 4.2: feasible P S 2(+) p R + {λ } {η } V () λ M () i,j I < v () λ m () i,j I V (+) v (+) λ for i, j I η for i, j I + + η M (+) i,j I + + η m (+) i,j I + (note that the sets I and I + should use the definitions of this section). umerical results are available in Fiure 4.

Dissipativity Theory for Nesterov s Accelerated Method

Dissipativity Theory for Nesterov s Accelerated Method Bin Hu Laurent Lessard Abstract In this paper, we adapt the control theoretic concept of dissipativity theory to provide a natural understanding of Nesterov s accelerated method. Our theory ties rigorous