Dissipativity Theory for Nesterov s Accelerated Method

Similar documents
automating the analysis and design of large-scale optimization algorithms

Analytical Convergence Regions of Accelerated First-Order Methods in Nonconvex Optimization under Regularity Condition

Analysis of Approximate Stochastic Gradient Using Quadratic Constraints and Sequential Semidefinite Programs

A Conservation Law Method in Optimization

arxiv: v1 [math.oc] 3 Nov 2017

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Gradient Methods Using Momentum and Memory

Integration Methods and Optimization Algorithms

Optimization methods

Technical Notes and Correspondence

Nonlinear Control Design for Linear Differential Inclusions via Convex Hull Quadratic Lyapunov Functions

Robust Stability. Robust stability against time-invariant and time-varying uncertainties. Parameter dependent Lyapunov functions

An asymptotic ratio characterization of input-to-state stability

c 1998 Society for Industrial and Applied Mathematics

A Unified Approach to Proximal Algorithms using Bregman Distance

ROBUST STABILITY TEST FOR UNCERTAIN DISCRETE-TIME SYSTEMS: A DESCRIPTOR SYSTEM APPROACH

Maximization of Submodular Set Functions

arxiv: v3 [math.oc] 8 Jan 2019

A projection algorithm for strictly monotone linear complementarity problems.

WE consider an undirected, connected network of n

1 Lyapunov theory of stability

On the Stabilization of Neutrally Stable Linear Discrete Time Systems

Written Examination

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

18.657: Mathematics of Machine Learning

Geometric Descent Method for Convex Composite Minimization

Cubic regularization of Newton s method for convex problems with constraints

arxiv: v2 [math.oc] 5 May 2018

Designing Stable Inverters and State Observers for Switched Linear Systems with Unknown Inputs

How to Escape Saddle Points Efficiently? Praneeth Netrapalli Microsoft Research India

A State-Space Approach to Control of Interconnected Systems

FINITE HORIZON ROBUST MODEL PREDICTIVE CONTROL USING LINEAR MATRIX INEQUALITIES. Danlei Chu, Tongwen Chen, Horacio J. Marquez

Lecture Note 5: Semidefinite Programming for Stability Analysis

Minimum-Phase Property of Nonlinear Systems in Terms of a Dissipation Inequality

1 The independent set problem

The Proximal Gradient Method

Exponential Decay Rate Conditions for Uncertain Linear Systems Using Integral Quadratic Constraints

EE363 homework 8 solutions

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Dissipativity. Outline. Motivation. Dissipative Systems. M. Sami Fadali EBME Dept., UNR

Passivity-based Stabilization of Non-Compact Sets

Newton-like method with diagonal correction for distributed optimization

arxiv: v1 [math.oc] 1 Jul 2016

6. Proximal gradient method

Lecture 1: January 12

IFT Lecture 2 Basics of convex analysis and gradient descent

KRIVINE SCHEMES ARE OPTIMAL

ADMM and Accelerated ADMM as Continuous Dynamical Systems

A Second-Order Path-Following Algorithm for Unconstrained Convex Optimization

Memoryless output feedback nullification and canonical forms, for time varying systems

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Nonlinear Systems Theory

Comparison of Modern Stochastic Optimization Algorithms

Optimization-based Modeling and Analysis Techniques for Safety-Critical Software Verification

arxiv: v4 [math.oc] 5 Jan 2016

Network Reconstruction from Intrinsic Noise: Non-Minimum-Phase Systems

Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Positive semidefinite matrix approximation with a trace constraint

On stability analysis of linear discrete-time switched systems using quadratic Lyapunov functions

Riccati difference equations to non linear extended Kalman filter constraints

Optimization Tutorial 1. Basic Gradient Descent

Linear-Quadratic Optimal Control: Full-State Feedback

Newton-like method with diagonal correction for distributed optimization

Optimality, Duality, Complementarity for Constrained Optimization

Stochastic and online algorithms

Analysis and synthesis: a complexity perspective

Further Results on Model Structure Validation for Closed Loop System Identification

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Gramians based model reduction for hybrid switched systems

An improved convergence theorem for the Newton method under relaxed continuity assumptions

Converse Results on Existence of Sum of Squares Lyapunov Functions

An Optimization-based Approach to Decentralized Assignability

arxiv: v1 [math.oc] 24 Sep 2018

On the convergence of normal forms for analytic control systems

LMI Methods in Optimal and Robust Control

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Inertial Game Dynamics

Research Article An Equivalent LMI Representation of Bounded Real Lemma for Continuous-Time Systems

Fast proximal gradient methods

Stability of linear time-varying systems through quadratically parameter-dependent Lyapunov functions

Convergence rates for distributed stochastic optimization over random networks

Lecture: Adaptive Filtering

Research Article Convex Polyhedron Method to Stability of Continuous Systems with Two Additive Time-Varying Delay Components

Hybrid Systems Course Lyapunov stability

Multiresolution analysis by infinitely differentiable compactly supported functions. N. Dyn A. Ron. September 1992 ABSTRACT

Lecture Note 7: Switching Stabilization via Control-Lyapunov Function

Analysis of periodic solutions in tapping-mode AFM: An IQC approach

THE HARTMAN-GROBMAN THEOREM AND THE EQUIVALENCE OF LINEAR SYSTEMS

Delay-dependent Stability Analysis for Markovian Jump Systems with Interval Time-varying-delays

Stationary Probabilities of Markov Chains with Upper Hessenberg Transition Matrices

Auxiliary signal design for failure detection in uncertain systems

Approximation Metrics for Discrete and Continuous Systems

Distributed Randomized Algorithms for the PageRank Computation Hideaki Ishii, Member, IEEE, and Roberto Tempo, Fellow, IEEE

Global convergence of the Heavy-ball method for convex optimization

An Accelerated Hybrid Proximal Extragradient Method for Convex Optimization and its Implications to Second-Order Methods

Intelligent Control. Module I- Neural Networks Lecture 7 Adaptive Learning Rate. Laxmidhar Behera

Composite nonlinear models at scale

Transcription:

Bin Hu Laurent Lessard Abstract In this paper, we adapt the control theoretic concept of dissipativity theory to provide a natural understanding of Nesterov s accelerated method. Our theory ties rigorous convergence rate analysis to the physically intuitive notion of energy dissipation. Moreover, dissipativity allows one to efficiently construct Lyapunov functions (either numerically or analytically) by solving a small semidefinite program. Using novel supply rate functions, we show how to recover nown rate bounds for Nesterov s method and we generalize the approach to certify both linear and sublinear rates in a variety of settings. Finally, we lin the continuous-time version of dissipativity to recent wors on algorithm analysis that use discretizations of ordinary differential equations.. Introduction Nesterov s accelerated method (Nesterov, 003) has garnered attention and interest in the machine learning community because of its fast global convergence rate guarantees. he original convergence rate proofs of Nesterov s accelerated method are derived using the method of estimate sequences, which has proven difficult to interpret. his observation motivated a sequence of recent wors on new analysis and interpretations of Nesterov s accelerated method (Bubec et al., 05; Lessard et al., 06; Su et al., 06; Drusvyatsiy et al., 06; Flammarion & Bach, 05; Wibisono et al., 06; Wilson et al., 06). Many of these recent papers rely on Lyapunov-based stability arguments. Lyapunov theory is an analogue to the principle of minimum energy and brings a physical intuition to convergence behaviors. When applying such proof techniques, one must construct a Lyapunov function, which is a nonnegative function of the algorithm s state (an internal energy ) that decreases along all admissible trajec- University of Wisconsin Madison, Madison, WI 53706, USA. Correspondence to: Bin Hu <bhu38@wisc.edu>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 07. Copyright 07 by the author(s). tories. Once a Lyapunov function is found, one can relate the rate of decrease of this internal energy to the rate of convergence of the algorithm. he main challenge in applying Lyapunov s method is finding a suitable Lyapunov function. here are two main approaches for Lyapunov function constructions. he first approach adopts the integral quadratic constraint (IQC) framewor (Megretsi & Rantzer, 997) from control theory and formulates a linear matrix equality (LMI) whose feasibility implies the linear convergence of the algorithm (Lessard et al., 06). Despite the generality of the IQC approach and the small size of the associated LMI, one must typically resort to numerical simulations to solve the LMI. he second approach sees an ordinary differential equation (ODE) that can be appropriately discretized to yield the algorithm of interest. One can then gain intuition about the trajectories of the algorithm by examining trajectories of the continuous-time ODE (Su et al., 06; Wibisono et al., 06; Wilson et al., 06). he wor of Wilson et al. (06) also establishes a general equivalence between Lyapunov functions and estimate sequence proofs. In this paper, we bridge the IQC approach (Lessard et al., 06) and the discretization approach (Wilson et al., 06) by using dissipativity theory (Willems, 97a;b). he term dissipativity is borrowed from the notion of energy dissipation in physics and the theory provides a general approach for the intuitive understanding and construction of Lyapunov functions. Dissipativity for quadratic Lyapunov functions in particular (Willems, 97b) has seen widespread use in controls. In the sequel, we tailor dissipativity theory to the automated construction of Lyapunov functions, which are not necessarily quadratic, for the analysis of optimization algorithms. Our dissipation inequality leads to an LMI condition that is simpler than the one in Lessard et al. (06) and hence more amenable to being solved analytically. When specialized to Nesterov s accelerated method, our LMI recovers the Lyapunov function proposed in Wilson et al. (06). Finally, we extend our LMI-based approach to the sublinear convergence analysis of Nesterov s accelerated method in both discrete and continuous time domains. his complements the original LMI-based approach in Lessard et al. (06), which mainly handles linear convergence rate analyses.

An LMI-based approach for sublinear rate analysis similar to ours was independently and simultaneously proposed by Fazlyab et al. (07). While this wor and the present wor both draw connections to the continuous-time results mentioned above, different algorithms and function classes are emphasized. For example, Fazlyab et al. (07) develops LMIs for gradient descent and proximal/projectionbased variants with convex/quasi-convex objective functions. In contrast, the present wor develops LMIs tailored to the analysis of discrete-time accelerated methods and Nesterov s method in particular.. Preliminaries.. Notation Let R and R + denote the real and nonnegative real numbers, respectively. Let I p and 0 p denote the p p identity and zero matrices, respectively. he Kronecer product of two matrices is denoted A B and satisfies the properties (A B) = A B and (A B)(C D) = (AC) (BD) when the matrices have compatible dimensions. Matrix inequalities hold in the semidefinite sense unless otherwise indicated. A differentiable function f : R p R is m-strongly convex if f(x) f(y) + f(y) (x y) + m x y for all x, y R p and is L-smooth if f(x) f(y) L x y for all x, y R p. Note that f is convex if f is 0-strongly convex. We use x to denote a point satisfying f(x ) = 0. When f is L-smooth and m-strongly convex, x is unique... Classical Dissipativity heory Consider a linear dynamical system governed by the statespace model ξ + = Aξ + B. () Here, ξ R n ξ is the state, R nw is the input, A R n ξ n ξ is the state transition matrix, and B R n ξ n w is the input matrix. he input can be physically interpreted as a driving force. Classical dissipativity theory describes how the internal energy stored in the state ξ evolves with time as one applies the input to drive the system. A ey concept in dissipativity theory is the supply rate, which characterizes the energy change in ξ due to the driving force. he supply rate is a function S : R n ξ R nw R that maps any state/input pair (ξ, w) to a scalar measuring the amount of energy delivered from w to state ξ. Now we introduce the notion of dissipativity. Definition he dynamical system () is dissipative with respect to the supply rate S if there exists a function V : R n ξ R + such that V (ξ) 0 for all ξ R n ξ and V (ξ + ) V (ξ ) S(ξ, ) () for all. he function V is called a storage function, which quantifies the energy stored in the state ξ. In addition, () is called the dissipation inequality. he dissipation inequality () states that the change of the internal energy stored in ξ is equal to the difference between the supplied energy and the dissipated energy. Since there will always be some energy dissipating from the system, the change in the stored energy (which is exactly V (ξ + ) V (ξ )) is always bounded above by the energy supplied to the system (which is exactly S(ξ, )). A variant of () nown as the exponential dissipation inequality states that for some 0 ρ <, we have V (ξ + ) ρ V (ξ ) S(ξ, ), (3) which states that at least a fraction ( ρ ) of the internal energy will dissipate at every step. he dissipation inequality (3) provides a direct way to construct a Lyapunov function based on the storage function. It is often the case that we have prior nowledge about how the driving force is related to the state ξ. hus, we may now additional information about the supply rate function S(ξ, ). For example, if S(ξ, ) 0 for all then (3) directly implies that V (ξ + ) ρ V (ξ ), and the storage function V can serve as a Lyapunov function. he condition S(ξ, ) 0 means that the driving force does not inject any energy into the system and may even extract energy out of the system. hen, the internal energy will decrease no slower than the linear rate ρ and approach a minimum value at equilibrium. An advantage of dissipativity theory is that for any quadratic supply rate, one can automatically construct the dissipation inequality using semidefinite programming. We now state a standard result from the controls literature. heorem Consider the following quadratic supply rate with X R (n ξ+n w) (n ξ +n w) and X = X. ξ ξ S(ξ, w) := X. (4) w w If there exists a matrix P R n ξ n ξ with P 0 such that A P A ρ P A P B B P A B X 0, (5) P B then the dissipation inequality (3) holds for all trajectories of () with V (ξ) := ξ P ξ. Proof. Based on the state-space model (), we have V (ξ + ) = ξ +P ξ + = (Aξ + B ) P (Aξ + B ) ξ A = P A A P B ξ B P A B. P B

Hence we can left and right multiply (5) by ξ w and ξ, w and directly obtain the desired conclusion. he left-hand side of (5) is linear in P, so (5) is a linear matrix inequality (LMI) for any fixed A, B, X, ρ. he set of P such that (5) holds is therefore a convex set and can be efficiently searched using interior point methods, for example. o apply the dissipativity theory for linear convergence rate analysis, one typically follows two steps.. Choose a proper quadratic supply rate function S satisfying certain desired properties, e.g. S(ξ, ) 0.. Solve the LMI (5) to obtain a storage function V, which is then used to construct a Lyapunov function. In step, the LMI obtained is typically very small, e.g. or 3 3, so we can often solve the LMI analytically. For illustrative purposes, we rephrase the existing LMI analysis of the gradient descent method (Lessard et al., 06, 4.4) using the notion of dissipativity..3. Example: Dissipativity for Gradient Descent here is an intrinsic connection between dissipativity theory and the IQC approach (Megretsi et al., 00; Seiler, 05). he IQC analysis of the gradient descent method in Lessard et al. (06) may be reframed using dissipativity theory. hen, the pointwise IQC (Lessard et al., 06, Lemma 6) amounts to using a quadratic supply rate S with S 0. Specifically, assume f is L-smooth and m-strongly convex, and consider the gradient descent method x + = x α f(x ). (6) We have x + x = x x α f(x ), where x is the unique point satisfying f(x ) = 0. Define ξ := x x and := f(x ). hen the gradient descent method is modeled by () with A := I p and B := αi p. Since = f(ξ + x ), we can define the following quadratic supply rate S(ξ, ) = ξ mli p (m + L)I p (m + L)I p ξ I p (7) By co-coercivity, we have S(ξ, ) 0 for all. his just restates Lessard et al. (06, Lemma 6). hen, we can directly apply heorem to construct the dissipation inequality. We can parameterize P = p I p and define the storage function as V (ξ ) = p ξ = p. he LMI (5) becomes ( ) ( ρ )p αp ml m + L αp α + I p m + L p 0. Hence for any 0 ρ <, we have p x + x ρ p if there exists p 0 such that ( ρ )p αp ml m + L αp α + 0 (8) p m + L he LMI (8) is simple and can be analytically solved to recover the existing rate results for the gradient descent method. For example, we can choose (α, ρ, p) to be ( L, m L, L ) or ( L+m, L m L+m, (L + m) ) to immediately recover the standard rate results in Polya (987). Based on the example above, it is evident that choosing a proper supply rate is critical for the construction of a Lyapunov function. he supply rate (7) turns out to be inadequate for the analysis of Nesterov s accelerated method. For Nesterov s accelerated method, the dependence between the internal energy and the driving force is more complicated due to the presence of momentum terms. We will next develop a new supply rate that captures this complicated dependence. We will also mae use of this new supply rate to recover the standard linear rate results for Nesterov s accelerated method. 3. Dissipativity for Accelerated Linear Rates 3.. Dissipativity for Nesterov s Method Suppose f is L-smooth and m-strongly convex with m > 0. Let x be the unique point satisfying f(x ) = 0. Now we consider Nesterov s accelerated method, which uses the following iteration rule to find x : We can rewrite (9) as x+ x x + = y α, y = ( + β)x βx. (9a) (9b) x x = A + Bw x x (0) where := = f (( + β)x βx ). Also, A := à I p, B := B I p, and Ã, B are defined by à := + β β α, B :=. () 0 0 Hence, Nesterov s accelerated method (9) is in the form of () with ξ = ( ) (x x ). Nesterov s accelerated method can improve the convergence rate since the input depends on both x and x, and drives the state in a specific direction, i.e. along (+β)x βx. his leads to a supply rate that extracts energy out of the system significantly faster than with gradient descent. his is formally stated in the next lemma.

Lemma 3 Let f be L-smooth and m-strongly convex with m > 0. Let x be the unique point satisfying f(x ) = 0. Consider Nesterov s method (9) or equivalently (0). he following inequalities hold for all trajectories. X X f(x ) f(x + ) f(x ) f(x + ) where X i = X i I p for i =,, and X i are defined by X := β m β m β β m β m β () β β α( Lα) X := ( + β) m β( + β)m ( + β) β( + β)m β m β (3) ( + β) β α( Lα) Given any 0 ρ, one can define the supply rate as (4) with a particular choice of X := ρ X + ( ρ )X. hen this supply rate satisfies the condition S(ξ, ) ρ (f(x ) f(x )) (f(x + ) f(x )). (4) Proof. he proof is similar to the proof of (3.3) (3.4) in Bubec (05), but Bubec (05, Lemma 3.6) must be modified to account for the strong convexity of f. See the supplementary material for a detailed proof. he supply rate (4) captures how the driving force is impacting the future state x +. he physical interpretation is that there is some amount of hidden energy in the system that taes the form of f(x ) f(x ). he supply rate condition (4) describes how the driving force is coupled with the hidden energy in the future. It says the delivered energy is bounded by a weighted decrease of the hidden energy. Based on this supply rate, one can search Lyapunov function using the following theorem. heorem 4 Let f be L-smooth and m-strongly convex with m > 0. Let x be the unique point satisfying f(x ) = 0. Consider Nesterov s accelerated method (9). For any rate 0 ρ <, set X := ρ X +( ρ ) X where X and X are defined in () (3). In addition, let Ã, B be defined by (). If there exists a matrix P R with P 0 such that à P à ρ P B P à à P B B P B X 0 (5) then set P := P I p and define the Lyapunov function V := x x x x P + f(x x x x x ) f(x ), (6) which satisfies V + ρ V for all. Moreover, we have f(x ) f(x ) ρ V 0 for Nesterov s method. Proof. ae the Kronecer product of (5) and I p, and hence (5) holds with A := à I p, B := B I p, and X := X I p. Let the supply rate S be defined by (4). hen, define the quadratic storage function V (ξ ) := ξ P ξ and apply heorem to show V (ξ + ) ρ V (ξ ) S(ξ, ). Based on the supply rate condition (4), we can define the Lyapunov function V := V (ξ )+f(x ) f(x ) and show V + ρ V. Finally, since P 0, we have f(x ) f(x ) ρ V 0. We can immediately recover the proposed Lyapunov function in Wilson et al. (06, heorem 6) by setting P to L P = L m m L. (7) L Clearly P 0. Now define κ := L m. Given α = L, β = κ κ+, and ρ = m L, it is straightforward to verify that the left side of the LMI (5) is equal to m( κ ) 3 (κ + 0 0, κ) 0 0 0 which is clearly negative semidefinite. Hence we can immediately construct a Lyapunov function using (6) to prove the linear rate ρ = m L. Searching for analytic certificates such as (7) can either be carried out by directly analyzing the LMI, or by using numerical solutions to guide the search. For example, numerically solving (5) for any fixed L and m directly yields (7), which maes finding the analytical expression easy. 3.. Dissipativity heory for More General Methods We demonstrate the generality of the dissipativity theory on a more general variant of Nesterov s method. Consider a modified accelerated method x + = ( + β)x βx α, y = ( + η)x ηx. (8a) (8b) When β = η, we recover Nesterov s accelerated method. When η = 0, we recover the Heavy-ball method of Polya

(987). We can rewrite (8) in state-space form (0) where := = f (( + η)x ηx ), A := à I p, B := B I p, and Ã, B are defined by à := + β β 0, B := α 0. Lemma 5 Let f be L-smooth and m-strongly convex with m > 0. Let x be the unique point satisfying f(x ) = 0. Consider the general accelerated method (8). Define the state ξ := ( ) (x x ) and the input := = f(( + η)x ηx ). hen the following inequalities hold for all trajectories. ξ ξ (X + X ) f(x ) f(x + ) (9) ξ (X + X 3 ) ξ f(x w ) f(x + ) (0) with X i := X i I p for i =,, 3, and X i are defined by X := Lδ Lδ ( Lα)δ Lδ Lδ ( Lα)δ ( Lα)δ ( Lα)δ α( Lα) X := η m η m η η m η m η η η 0 X 3 := ( + η) m η( + η)m ( + η) η( + η)m η m η ( + η) η 0 with δ := β η. In addition, one can define the supply rate as (4) with X := X + ρ X + ( ρ )X 3. hen for all trajectories (ξ, ) of the general accelerated method (8), this supply rate satisfies the inequality S(ξ, ) ρ (f(x ) f(x )) (f(x + ) f(x )). () Proof. A detailed proof is presented in the supplementary material. One mainly needs to modify the proof by taing the difference between β and η into accounts. Based the supply rate (), we can immediately modify heorem 4 to handle the more general algorithm (8). Although we do not have general analytical formulas for the convergence rate of (8), preliminary numerical results suggest that there are a family of (α, β, η) leading to the rate ρ = m L, and the required value of P is quite different from (7). his indicates that our proposed LMI approach could go beyond the Lyapunov function (7). Remar 6 It is noted in Lessard et al. (06, 3.) that searching over combinations of multiple IQCs may yield improved rate bounds. he same is true of supply rates. For example, we could include λ, λ 0 as decision variables and search for a dissipation inequality with supply rate λ S + λ S where e.g. S is (7) and S is (). 4. Dissipativity for Sublinear Rates he LMI approach in (Lessard et al., 06) is tailored for the analysis of linear convergence rates for algorithms that are time-invariant (the A and B matrices in () do not change with ). We now show that dissipativity theory can be used to analyze the sublinear rates O(/) and O(/ ) via slight modifications of the dissipation inequality. 4.. Dissipativity for O(/) rates he O(/) modification, which we present first, is very similar to the linear rate result. heorem 7 Suppose f has a finite minimum f. Consider the LI system () with a supply rate satisfying S(ξ, ) (f(z ) f ) () for some sequence {z }. If there exists a nonnegative storage function V such that the dissipation inequality () holds over all trajectories of (ξ, ), then the following inequality holds over all trajectories as well. (f(z ) f ) V (ξ 0 ). (3) =0 In addition, we have the sublinear convergence rate min (f(z ) f ) V (ξ 0) : +. (4) If f(z + ) f(z ) for all, then (4) implies that f(z ) f V (ξ0) + for all. Proof. By the supply rate condition () and the dissipation inequality (), we immediately get V (ξ + ) V (ξ ) + f(z ) f 0. Summing the above inequality from = 0 to and using V 0 yields the desired result. o address the sublinear rate analysis, the critical step is to choose an appropriate supply rate. If f is L-smooth and convex, this is easily done. Consider the gradient method (6) and define the quantities ξ :=, A := I p, and B := αi p as in Section.3. Since f is L-smooth and convex, define the quadratic supply rate ξ 0p S(ξ, ) := I p ξ I, p L I p

which satisfies S(ξ, ) f f(x ) for all (cocoercivity). hen we can directly apply the LMI (5) with ρ = to construct the dissipation inequality. Setting P = p I p and defining the storage function as V (ξ ) := p ξ = p, the LMI (5) becomes ( ) 0 αp 0 + I p 0, αp α p L which is equivalent to 0 αp + αp + α p L 0. (5) Due to the (, ) entry being zero, (5) holds if and only if { αp + = 0 { p = α α p L 0 = α L We can choose α = L and the bound (4) becomes min (f(x ) f ) L x 0 x. ( + ) Since gradient descent has monotonically nonincreasing iterates, that is f(x + ) f(x ) for all, we immediately recover the standard O(/) rate result. 4.. Dissipativity for O(/ ) rates Certifying a O(/) rate for the gradient method required solving a single LMI (5). However, this is not the case for the O(/ ) rate analysis of Nesterov s accelerated method. Nesterov s algorithm has parameters that depend on so the analysis is more involved. We will begin with the general case and then specialize to Nesterov s algorithm. Consider the dynamical system ξ + = A ξ + B (6) he state matrix A and input matrix B change with the time step, and hence (6) is referred to as a linear timevarying (LV) system. he analysis of LV systems typically requires a time-dependent supply rate such as S (ξ, ) := ξ ξ X (7) If there exists a sequence {P } with P 0 such that A P + A P A P +B B P +A B P X 0 (8) +B for all, then we have V + (ξ + ) V (ξ ) S (ξ, ) with the time-dependent storage function defined as V (ξ ) := ξ P ξ. his is a standard approach for dissipation inequality constructions of LV systems and can be proved using the same proof technique in heorem. Note that we need (8) to simultaneously hold for all. his leads to an infinite number of LMIs in general. Now we consider Nesterov s accelerated method for a convex L-smooth objective function f (Nesterov, 003). x + = y α, y = ( + β )x β x. (9a) (9b) It is nown that (9) achieves a rate of O(/ ) when α := /L and β is defined recursively as follows. ζ = 0, ζ + = + + 4ζ, β = ζ. ζ he sequence {ζ } satisfies ζ ζ = ζ. We now present a dissipativity theory for the sublinear rate analysis of Nesterov s accelerated method. Rewrite (9) as x+ x x x = A x x + B x x (30) where := = f (( + β )x β x ), A := Ã I p, B := B I p, and Ã, B are given by + β β Ã := α, B :=. 0 0 Hence, Nesterov s accelerated method (9) is in the form of (6) with ξ := ( ) (x x ). he O(/ ) rate analysis of Nesterov s method (9) requires the following time-dependent supply rate. Lemma 8 Let f be L-smooth and convex. Let x be a point satisfying f(x ) = 0. In addition, set f := f(x ). Consider Nesterov s method (9) or equivalently (30). he following inequalities hold for all trajectories and for all. M N f(x ) f(x + ) f(x ) f(x + ) where M := M I p, N := Ñ I p, and M, Ñ are defined by 0 0 β M := 0 0 β, (3) β β L 0 0 ( + β ) Ñ := 0 0. (3) ( + β ) β β L

Given any nondecreasing sequence {µ }, one can define the supply rate as (7) with the particular choice X := µ M + (µ + µ )N for all. hen this supply rate satisfies the condition S(ξ, ) µ (f(x ) f ) µ + (f(x + ) f ). (33) Proof. he proof is very similar to the proof of Lemma 3 with an extra condition m = 0. A detailed proof is presented in the supplementary material. heorem 9 Consider the LV dynamical system (6). If there exist matrices {P } with P 0 and a nondecreasing sequence of nonnegative scalars {µ } such that A P + A P A P +B B P +A B P +B µ M (µ + µ )N 0 (34) then we have V + (ξ + ) V (ξ ) S (ξ, ) with the storage function V (ξ ) := ξ P ξ and the supply rate (7) using X := µ M + (µ + µ )N for all. In addition, if this supply rate satisfies (33), we have f(x ) f µ 0(f(x 0 ) f ) + V 0 (ξ 0 ) µ (35) Proof. Based on the state-space model (6), we can left and right multiply (34) by ξ w and ξ w, and directly obtain the dissipation inequality. Combining this dissipation inequality with (33), we can show V + (ξ + ) + µ + (f + f ) V (ξ ) + µ (f f ). Summing the above inequality as in the proof of heorem 7 and using the fact that P 0 for all yields the result. We are now ready to show the O(/ ) rate result for Nesterov s accelerated method. Set µ := (ζ ) and P := L ζ ζ ζ ζ. Note that P 0 and µ + µ = ζ. It is straightforward to verify that this choice of {P, µ } maes the left side of (34) the zero matrix and hence (35) holds. Using the fact that ζ / (easily proved by induction), we have µ /4 and the O(/ ) rate for Nesterov s method follows. Remar 0 heorem 9 is quite general. he infinite family of LMIs (34) can also be applied for linear rate analysis and collapses down to the single LMI (5) in that case. o apply (34) to linear rate analysis, one needs to slightly modify (3) (3) such that the strong convexity parameter m is incorporated into the formulas of M, N. By setting µ := ρ and P := ρ P, then the LMI (34) is the same for all and we recover (5). his illustrates how the infinite number of LMIs (34) can collapse to a single LMI under special circumstances. 5. Continuous-time Dissipation Inequality Finally, we briefly discuss dissipativity theory for the continuous-time ODEs used in optimization research. Note that dissipativity theory was first introduced in Willems (97a;b) in the context of continuous-time systems. We denote continuous-time variables in upper case. Consider a continuous-time state-space model Λ(t) = A(t)Λ(t) + B(t)W (t) (36) where Λ(t) is the state, W (t) is the input, and Λ(t) denotes the time derivative of Λ(t). In continuous-time, the supply rate is a function S : R nλ R n W R + R that assigns a scalar to each possible state and input pair. Here, we allow S to also depend on time t R +. o simplify our exposition, we will omit the explicit time dependence (t) from our notation. Definition he dynamical system (36) is dissipative with respect to the supply rate S if there exists a function V : R nλ R + R + such that V (Λ, t) 0 for all Λ R nλ and t 0 and V (Λ, t) S(Λ, W, t) (37) for every trajectory of (36). Here, V denotes the Lie derivative (or total derivative); it accounts for Λ s dependence on t. he function V is called a storage function, and (37) is a (continuous-time) dissipation inequality. For any given quadratic supply rate, one can automatically construct the continuous-time dissipation inequality using semidefinite programs. he following result is standard in the controls literature. heorem Suppose X(t) R (nλ+n W ) (n Λ+n W ) and X(t) =X(t) for all t. Consider the quadratic supply rate S(Λ, W, t) := Λ X W Λ W for all t. (38) If there exists a family of matrices P (t) R nλ nλ with P (t) 0 such that A P + P A + P P B B X 0 P 0 for all t. (39) hen we have V (Λ, t) S(Λ, W, t) with the storage function defined as V (Λ, t) := Λ P Λ.

Proof. Based on the state-space model (36), we can apply the product rule for total derivatives and obtain V (Λ, t) = Λ P Λ + Λ P Λ + Λ P Λ Λ A = P + P A + P P B Λ W B. P 0 W Hence we can left and right multiply (39) by Λ W and Λ W and obtain the desired conclusion. he algebraic structure of the LMI (39) is simpler than that of its discrete-time counterpart (5) because for given P, the continuous-time LMI is linear in A, B rather than being quadratic. his may explain why continuous-time ODEs are sometimes more amenable to analytic approaches than their discretized counterparts. We demonstrate the utility of (39) on the continuous-time limit of Nesterov s accelerated method in Su et al. (06): Ÿ + 3 Ẏ + f(y ) = 0, (40) t which we rewrite as (36) with Λ :=, Ẏ Y x W := f(y ), x is a point satisfying f(x ) = 0, and A, B are defined by 3 A(t) := t I p 0 p Ip, B(t) :=. I p 0 p Suppose f is convex and set f := f(x ). Su et al. (06, heorem 3) constructs the Lyapunov function V(Y, t) := t (f(y ) f )+ Y + t Ẏ x to show that V 0 and then directly demonstrate a O(/t ) rate for the ODE (40). o illustrate the power of the dissipation inequality, we use the LMI (39) to recover this Lyapunov function. Denote G(Y, t) := t (f(y ) f ). Note that convexity implies f(y ) f f(y ) (Y x ), which we rewrite as Ẏ 0 p 0 p 0 p Ẏ t(f(y ) f ) Y x 0 p 0 p ti p Y x. W 0 p ti p 0 p W Since Ġ(Y, t) = t(f(y ) f ) + t f(y ) Ẏ, we have Ẏ Ġ Y x W 0 p t 0 p 0 p I p Ẏ 0 p 0 p ti p Y x. t I p ti p 0 p W Now choose the supply rate S as (38) with X(t) given by t 0 p 0 p I p X(t) := 0 p 0 p ti p. t I p ti p 0 p Clearly S(Λ, W, t) Ġ(Y, t). Now we can choose P (t) := t I p I t p I p I p. Substituting P and X into (39), the left side of (39) becomes identically zero. herefore, V (Λ, t) S(Λ, W, t) Ġ(Y, t) with the storage function V (Λ, t) := Λ P Λ. By defining the Lyapunov function V(Y, t) := V (Y, t) + G(Y, t), we immediately obtain V 0 and also recover the same Lyapunov function used in Su et al. (06). Remar 3 As in the discrete-time case, the infinite family of LMIs (39) can also be reduced to a single LMI for the linear rate analysis of continuous-time ODEs. For further discussion on the topic of continuous-time exponential dissipation inequalities, see Hu & Seiler (06). 6. Conclusion and Future Wor In this paper, we developed new notions of dissipativity theory for understanding of Nesterov s accelerated method. Our approach enjoys advantages of both the IQC framewor (Lessard et al., 06) and the discretization approach (Wilson et al., 06) in the sense that our proposed LMI condition is simple enough for analytical rate analysis of Nesterov s method and can also be easily generalized to more complicated algorithms. Our approach also gives an intuitive interpretation of the convergence behavior of Nesterov s method using an energy dissipation perspective. One potential application of our dissipativity theory is for the design of accelerated methods that are robust to gradient noise. his is similar to the algorithm design wor in Lessard et al. (06, 6). However, compared with the IQC approach in Lessard et al. (06), our dissipativity theory leads to smaller LMIs. his can be beneficial since smaller LMIs are generally easier to solve analytically. In addition, the IQC approach in Lessard et al. (06) is only applicable to strongly-convex objective functions while our dissipativity theory may facilitate the design of robust algorithm for wealy-convex objective functions. he dissipativity framewor may also lead to the design of adaptive or timevarying algorithms. Acnowledgements Both authors would lie to than the anonymous reviewers for helpful suggestions that improved the clarity and quality of the final manuscript. his material is based upon wor supported by the National Science Foundation under Grant No. 65695. Both authors also acnowledge support from the Wisconsin Institute for Discovery, the College of Engineering, and the Department of Electrical and Computer Engineering at the University of Wisconsin Madison.

References Bubec, S. Convex optimization: Algorithms and complexity. Foundations and rends R in Machine Learning, 8(3-4):3 357, 05. Bubec, S., Lee, Y., and Singh, M. A geometric alternative to Nesterov s accelerated gradient descent. arxiv preprint arxiv:506.0887, 05. Drusvyatsiy, D., Fazel, M., and Roy, S. An optimal first order method based on optimal quadratic averaging. arxiv preprint arxiv:604.06543, 06. Fazlyab, M., Ribeiro, A., Morari, M., and Preciado, V. M. Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. arxiv preprint arxiv:705.0365, 07. Wibisono, A., Wilson, A., and Jordan, M. A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, pp. 064734, 06. Willems, J.C. Dissipative dynamical systems Part I: General theory. Archive for Rational Mech. and Analysis, 45 (5):3 35, 97a. Willems, J.C. Dissipative dynamical systems Part II: Linear systems with quadratic supply rates. Archive for Rational Mech. and Analysis, 45(5):35 393, 97b. Wilson, A., Recht, B., and Jordan, M. A Lyapunov analysis of momentum methods in optimization. arxiv preprint arxiv:6.0635, 06. Flammarion, N. and Bach, F. From averaging to acceleration, there is only a step-size. In COL, pp. 658 695, 05. Hu, B. and Seiler, P. Exponential decay rate conditions for uncertain linear systems using integral quadratic constraints. IEEE ransactions on Automatic Control, 6 ():356 3567, 06. Lessard, L., Recht, B., and Pacard, A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 6():57 95, 06. Megretsi, A. and Rantzer, A. System analysis via integral quadratic constraints. IEEE ransactions on Automatic Control, 4:89 830, 997. Megretsi, A., Jönsson, U., Kao, C. Y., and Rantzer, A. Control Systems Handboo, chapter 4: Integral Quadratic Constraints. CRC Press, 00. Nesterov, Y. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 003. Polya, B.. Introduction to optimization. Optimization Software, 987. Seiler, P. Stability analysis with dissipation inequalities and integral quadratic constraints. IEEE ransactions on Automatic Control, 60(6):704 709, 05. Su, W., Boyd, S., and Candès, E. A differential equation for modeling Nesterov s accelerated gradient method: heory and insights. Journal of Machine Learning Research, 7(53): 43, 06.