arxiv: v4 [math.oc] 1 Apr 2018

Size: px

Start display at page:

Download "arxiv: v4 [math.oc] 1 Apr 2018"

Shon Rice
5 years ago
Views:

1 Generalizing the optimized gradient method for smooth convex minimization Donghwan Kim Jeffrey A. Fessler arxiv: v4 [math.oc] Apr 08 Date of current version: April 3, 08 Abstract This paper generalizes the optimized gradient method OGM [9, 5, 6] that achieves the optimal worst-case cost function bound of first-order methods for smooth convex minimization [7]. Specifically, this paper studies a generalized formulation of OGM and analyzes its worst-case rates in terms of both the function value and the norm of the function gradient. This paper also develops a new algorithm called OGM-OG that is in the generalized family of OGM and that has the best known analytical worst-case bound with rate O/N.5 on the decrease of the gradient norm among fixed-step first-order methods. This paper also proves that Nesterov s fast gradient method [4,6] has an O/N.5 worstcase gradient norm rate but with constant larger than OGM-OG. The proof is based on the worst-case analysis called Performance Estimation Problem in [9]. Introduction First-order methods are favorable for solving large-scale problems because their computational complexity per iteration depends mildly on the problem dimension. In particular, Nesterov s fast gradient method FGM[4,6]achievestheoptimalworst-caserateO/N fordecreasingsmoothconvexfunctionsafter N iterations [5], and thus has been widely used in large-scale applications. Recently, the optimized gradient method OGM [9, 5, 6] was found to achieve the optimal worst-case cost function bound of first-order methods with either fixed-step or adaptive-step approaches for smooth convex minimization in [7], whereas FGM achieves that bound only up to constant. Building upon [9, 5, 6], this paper presents two different ways of generalizing OGM and its development. First, this paper specifies a parameterized family of algorithms that generalizes OGM, and provides worst-case bounds on the function and gradient norm values for this family. Like the generalized forms of FGM [3, 6] being widely used and studied e.g., [, 3, 9], we believe introducing the generalized OGM here can be potentially useful. Second, this paper optimizes the step coefficients of fixed-step first-order methods with respect to the rate of decrease of the cost function s gradient norm, leading to a new algorithm called OGM-OG OG for optimized over a gradient. This development expands the choice of worst-case rate metrics for optimizing first-order methods in [9, 5, 6] that focused on the cost function decrease leading to OGM. We next briefly review the Performance Estimation Problem PEP [9] that was used in [9,5,6] to develop OGM and that we extensively use throughout the paper. This research was supported in part by NIH grant U0 EB Donghwan Kim Jeffrey A. Fessler Dept. of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 4809 USA kimdongh@umich.edu, fessler@umich.edu There is a backtracking line-search version of FGM [5] that also achieves the optimal worst-case function bound up to constant, which is sometimes more useful than the fixed-step FGM in practice. However, such backtracking line-search version of OGM with a fast worst-case bound is yet unknown unlike the fixed-step OGM [9,5,6], while recently an exact line-search version of OGM is developed in [8].

2 Drori and Teboulle [9] cast a worst-case analysis into an optimization problem called PEP [9] that examines the maximal absolute cost function inaccuracy over all possible inputs cost functions to the optimization algorithm. See e.g., [5,8,0,5,6,7,8,9,30,3] for its extensions. Moreover, Drori and Teboulle [9] optimized numerically the step coefficients of first-order methods using PEP for smooth convex minimization, and found an algorithm whose worst-case bound is lower than that of FGM, but it required too much computation and memory to be appealing for large-scale problems. Building on their work, the authors [5, 6] found computationally and memory-wise efficient version, called OGM, and showed analytically that OGM satisfies an analytical worst-case bound that is twice smaller than that of FGM. Drori [7] showed that the OGM is optimal for large-dimensional smooth convex minimization over a general class of first-order methods with either fixed or adaptive step sizes [7]. This OGM has been numerically extended for nonsmooth composite convex problems in [30]. In addition, this OGM-type algorithm was already studied in the context of a proximal point method [3, Appendix]. Using the PEP approach[9], this paper proposes a generalized version of OGMGOGM and analyzes its worst-case rates in terms of both the decrease of the cost function and the decrease of the norm of the gradient of the cost function. The results complement the worst-case analysis of the OGM [5, 6], and expands our understanding of OGM-type first-order methods. This paper analyzes the worst-case rate of the gradient norm in addition to that of the cost function because it is important when dealing with dual problems, considering that the dual gradient norm corresponds to the primal distance to feasibility see e.g., [6,, 7]. While FGM has not been shown previously to satisfy a rate O/N.5 for decreasing the gradient norm, modified versions of FGM with such rate were studied in [,,7]. This paper proves that FGM in fact does have that rate, building upon [3] that numerically conjectured such rate for FGM using the gradient norm version of PEP. For further acceleration of the worst-case gradient norm rate, we optimize the step coefficients of first-order methods with respect to the gradient norm using PEP and propose an algorithm named OGM-OG that belongs to the GOGM family and has the best known analytical worst-case bound on rate of decrease of the gradient norm among fixed-step first-order methods. One can extend some aspects of the approaches for generalizing OGM described in this paper to other optimization algorithms and problems. One direction we have already taken in [7] aims to improve the fast iterative shrinkage/thresholding algorithm FISTA [] that reduces to FGM for smooth convex problems for nonsmooth composite convex problems. Naturally, this paper and [7] use some similar approaches, but they are different; the methods in [7] when simplified to the smooth case correspond to a generalization of FGM that differs from the GOGM. Another direction we have recently taken in [8] focuses on optimizing the step coefficients of first-order methods with respect to the gradient norm under the initial bounded function condition that is different from the initial bounded distance condition used in this paper. Sec. defines the smooth convex problem and the first-order methods. Sec. 3 reviews and discusses worst-case analyses of a gradient method GM, FGM, and OGM for both the function value and the gradient norm. Sec. 3 also reviews first-order methods that guarantee an O/N.5 rate for the gradient decrease. Sec. 4 reviews the cost function form of PEP [9] and reviews how the OGM [5] is derived using such PEP. Sec. 4 then proposes a generalized version of OGM GOGM using the cost function form of PEP, and Sec. 5 provides a worst-case gradient norm bound for the GOGM using the gradient form of PEP. Then, Sec. 5 optimizes the step coefficients using the gradient form of PEP and proposes the OGM-OG that belongs to the GOGM family. Sec. 5 also proves that FGM decreases the gradient norm with a rate O/N.5. Sec. 6 and Sec. 7 provide discussion and conclusion. Smooth convex problem and first-order methods. Smooth convex problem We focus on the following smooth convex minimization problem min fx, x R d M The original PEP was intractable to solve, so a series of relaxation on the PEP was introduced in [9] to make it possibly solvable, which we review in Sec. 4..

3 where the following additional conditions are assumed: f : R d R is a convex function of the type C, L Rd, i.e., continuously differentiable with Lipschitz continuous gradient: fx fy L x y, x,y R d,. where L > 0 is the Lipschitz constant. The optimal set X f = argmin x R d fx is nonempty, i.e., the problem M is solvable. We use F L R d to denote the class of functions that satisfy the above conditions. We also assume that the distance between an initial point x 0 and an optimal solution x Xf is bounded by some R > 0, i.e., x 0 x R... First-order methods To solve M, we consider the following class of fixed-step or non-apdative-step first-order methods FSFOM, where the update step at i + th iteration is a weighted sum of the previous and current gradients{ fx k } i scaledby L withfixedconstantstepcoefficients{h i+,k} i thatarenotadaptive tothegivenf andx 0 andthuslandr.thisclassfsfomincludesgm, FGM,OGM,andthemethods proposed in this paper, but excludes line-search-type methods. Algorithm Class FSFOM Input: f F L R d, x 0 R d. For i = 0,...,N x i+ = x i i h i+,k fx k. L 3 Review of the worst-case analysis of FSFOM This section reviews the worst-case analysis of existing FSFOMs and simple variants thereof in terms of bounds on the cost function and gradient norm. Sec. 3. reviews the worst-case cost function decrease of GM, FGM, and OGM. Sec. 3. presents both FSFOMs including GM, FGM, OGM and some variants that have either O/N or O/N.5 rate for the worst-case gradient decrease; it also reviews an O/N lower bound of the worst-case rates of first-order methods for decreasing the gradient norm. 3. Function value worst-case analysis of FSFOM The simplest example of a FSFOM is the following GM that uses only the current gradient and the Lipschitz constant L for the update. Algorithm GM Input: f F L R d, x 0 R d. For i = 0,,... x i+ = x i L fx i 3

4 This GM monotonically decreases the cost function [5] and satisfies the following tight 3 worst-case bound [9, Thm. ], for any i 0, fx i fx LR 4i+. 3. Among the class FSFOM, the following two equivalent forms of FGM [4,6] have been used widely because they decrease the cost function with the optimal rate O/N. Algorithm FGM Input: f F L R d, x 0 = y 0 R d, t 0 =. For i = 0,,... y i+ = x i L fx i + +4t i t i+ =, x i+ = y i+ + t i y i+ y i t i+ Algorithm FGM Input: f F L R d, x 0 = y 0 R d, t 0 =. For i = 0,,... y i+ = x i L fx i z i+ = x 0 i t k fx k L + +4t i t i+ =, x i+ = y i+ + z i+ t i+ t i+ Specifically, the FGM and FGM iterates satisfy the following worst-case cost function bounds [5, 4, 6] for any i : fy i fx LR t LR i i+, and fx i fx LR t i where the parameter t i satisfies t i = i l=0 t l and t i i+ LR i+, 3. for all i. 3.3 A generalized form of FGM in [3] uses parameters t i satisfying t 0 = and t i t i + t i, including the choice t i = i+a a for any a. There is another generalized form of FGM in [6], and these generalized forms of FGM have been widely used and studied e.g., [, 3, 9]. Similarly this paper studies generalizations of the OGM. Building upon [9] that optimized numerically the step coefficients over the cost function form of PEP, the authors [5] developed the following two equivalent forms of OGM, as reviewed in Sec. 4. Algorithm OGM Input: f F L R d, x 0 = y 0 R d, θ 0 =. For i = 0,...,N y i+ = x i L fx i + +4θ i, i N θ i+ = + +8θ i, i = N x i+ = y i+ + θ i θ i+ y i+ y i + θ i θ i+ y i+ x i Algorithm OGM Input: f F L R d, x 0 = y 0 R d, θ 0 =. For i = 0,...,N y i+ = x i L fx i z i+ = x 0 i θ k fx k L + +4θ i, i N θ i+ = + +8θ i, i = N x i+ = y i+ + z i+ θ i+ θ i+ 3 A tight worst-case bound denotes an inequality where the equality holds for some function f. For example, [9, Thm. ] shows the bound 3. is tight. 4

5 The OGM iterates satisfy the following worst-case cost function bounds [5, 6]: for any i N, and The parameter sequence θ i satisfies fy i fx LR 4θi LR i+, 3.4 fx N fx LR θ N θ i = { i+ l=0 θ l, i N, N l=0 θ l +θ N, i = N, LR N +N and θ i { i+, i N, i+, i = N, which is equivalent to t i 3.3 except at the final iteration. The bounds 3.4 and 3.5 of OGM are about twice smaller than the bounds 3. of FGM, so OGM decreases the cost function faster than FGM in the worst case and often in practice [4]. In addition, the bound 3.5 on the final iterate x N is tight and satisfies the optimal worst-case bound of general first-order methods including both FSFOM and adaptive-step first-order methods, when the condition d N + holds [7]. The additional term θi θ i+ y i+ x i of OGM and the additional constant for the update of z i of OGM, compared to FGM and FGM respectively, along with the parameter θ N, are what make OGM optimal for d N +. One of the main goals of this paper is to generalize the form of OGM and analyze the worst-case rate of such generalized OGM in terms of both the function value and the gradient norm, complementing the bounds 3.4 and 3.5 on the function value of OGM. The next section studies the worst-case rate of the gradient of FSFOM Gradient norm worst-case analysis of FSFOM When tackling dual problems, it is known that the gradient norm worst-case rate is important in addition to the function value worst-case rate because the dual gradient norm is related to the primal distance to feasibility see e.g., [6,,7]. One simple way to find a loose worst-casebound for the gradient norm is to use the well-known convex inequality for convex functions with L-Lipschitz continuous gradients [5]: L fx fx f x L fx fx fx, x R d, 3.7 as discussed in [5,3]. Combining the bounds 3., 3. and the inequality 3.7, for any i, the GM iterates satisfy and the iterates of FGM satisfy fx i Lfx i fx LR i+, 3.8 fy i LR t i LR i+, and fx i LR t i LR i Similarly for any i N, the OGM iterates with the bounds 3.4 and 3.5 satisfy fy i LR LR θi i+, and fx N LR LR θ N N Unfortunately, using the inequality 3.7 provides at best an O/N bound due to the optimal rate O/N of the function decrease. Furthermore, in general using 3.7 need not lead to tight worst-case bounds on the gradient norm. Using a different approach, a smaller O/N worst-case bound for the gradient norm of GM was derived in [7], as reviewed in next section. While the bounds 3.8, 3.9, and 3.0 are not guaranteed to be tight, the next section shows that the worst-case gradient bound 3.0 on the final iterate x N of OGM is in fact tight and thus has the same disappointingly slow O/N worst-case bound on the gradient norm as GM. 5

6 3.. FSFOM with rate O/N for decreasing the gradient norm This section uses the following lemma stating that GM monotonically decreases the gradient. Lemma [, Lemma.4] The GM monotonically decreases the gradient norm, i.e., x f L fx fx. 3. The following theorem reviews a simple proof in [7] that provides a worst-case gradient norm bound forgmwithrateo/nthat issmallerthan3.8, where[,thm. 6.]additionallyconsidersLemma. Theorem [, Thm. 6.], [7] Let f : R d R be F L R d and let x 0,,x N R d be generated by GM. Then for any N, LR min fx i = fx N. 3. i {0,...,N} NN + Proof Lemma implies the first equality in 3.. Using 3., 3.7, 3. yields: LR 3. fx m fx 3.7 fx N+ fx + 4m+ L 3. N m+ L fx N, N fx i which is equivalent to 3. using m = N/ for which m N and N m N. Inspired by the conjecture in [3, Sec. 4..3], the following theorem shows that the O/N rate of the worst-case gradient norm bound 3. of GM is tight up to a constant. Theorem Let x 0,,x N R d be generated by GM. Then for any N, LR N + max f F LR d, x Xf, x 0 x R min fx i = i {0,...,N} where the inequality in 3.3 is achieved by the following function in F L R d : ψx = { LR N+ x LR N+, x R N+, i=m max f F LR d, x Xf, x 0 x R fx N, 3.3 L x, x < R N Proof Lemma implies the equality in 3.3. Starting from x 0 = Rν, where ν is a unit vector, the GM iterates are x i = i Rν, ψx i = LR ν, i = 0,...,N, N + N + which implies the inequality 3.3. We next show that the bound 3.0 for the gradient norm at the final iterate x N of OGM is tight and and that its worst-case function is a simple quadratic function. Note that OGM was derived by optimizing a worst-case bound on the cost function decrease and its behavior in terms of gradient norms was not investigated previously. Theorem 3 Let x 0,,x N R d be generated by OGM. Then for any N, max f F LR d, x Xf, x 0 x R min fx i = i {0,...,N} max fx N = LR f F LR d, θ N x Xf, x 0 x R LR, 3.5 N + where the worst-case function in F L R d for OGM in terms of the gradient norm is the quadratic function φx = L x. 6

7 Proof See Appendix A. Comparing 3. and 3.5, we see that GM and OGM have essentially similar worst-case gradient norm bounds. This is a dilemma because OGM is the fastest FSFOM in terms of the worst-case cost function bound, but is as slow as GM in terms of the worst-case gradient norm bound. Therefore, one of the main goals of this paper is to study optimizing the step coefficients of FSFOM using PEP with respect to the gradient norm in Sec We next discuss the specific FSFOM in [7] that decreases the gradient norm with a faster O/N.5 rate. 3.. FSFOM with rate O/N.5 for decreasing the gradient norm Searching for a FSFOM that decreases the gradient norm faster than the O/N rate of GM and OGM, Nesterov [7] among other variants of FGM [, ] considered performing FGM for the first m iterations, and GM for the remaining iterations. He showed that this method, which we denote FGM-m, satisfies a fast rate O/N.5 for decreasing the gradient norm. In [6,,7], FGM-m for m = N/ was used to solve dual problems. To pursue a faster worst-case rate in terms of the constant factor, we consider here another variant that performs OGM for the first m iterations and GM for the remaining iterations, which we denote OGM-m. Algorithm OGM-m Input: f F L R d, x 0 = y 0 R d, ϑ 0 =, m {,...,N }. For i = 0,...,m y i+ = x i L fx i + +4ϑ i, i m ϑ i+ = + +8ϑ i, i = m x i+ = y i+ + ϑ i ϑ i+ y i+ y i + ϑ i ϑ i+ y i+ x i For i = m,...,n x i+ = x i L fx i The following theorem bounds the gradient norm of the OGM-m iterates, inspired by the proof in [, 7] for the worst-case gradient norm bound of the FGM-m iterates. The worst-case bound of FGM-m in [,7] is asymptotically -times larger than the following new bound 3.6 for OGM-m. Theorem 4 Let f : R d R be F L R d and let x 0, x N R d be generated by OGM-m for m N. Then for any N, LR min fx i fx N i {0,...,N} m+ N m Proof Using 3.5, 3.7, 3. yields: LR ϑ m 3.5 fx m fx 3.7 fx N+ fx + L 3. N m+ L fx N, which is equivalent to 3.6 using ϑ m m+ that is implied by 3.6. N fx i The bound 3.6 is minimized at a point close to m = N/3, leading to its approximately smallest constant 3 6 with the rate O/N.5. Other variants of FGM having O/N.5 worst-case gradient bounds were derived in [,]. Such variations of FGM including FGM-m were derived since, prior to this paper, it was unknown whether or not FGM decreases the gradient norm with the rate O/N.5 ; this rate for the gradient norm of i=m 7

8 FGM was conjectured numerically in [3]. Sec. 5. below uses the PEP to show for the first time the rate O/N.5 for the gradient decrease of the FGM. The bound 3.6 of OGM-m for decreasing the gradient is smaller than the bounds for the FGM variants in [,], and Sec. 5.4 below shows that our proposed methods have worst-case bounds even lower than 3.6. The preceding sections have focused on tight or upper worst-case bounds of the gradient norm decrease of first-order methods, whereas the next section reviews a lower bound for the worst-case gradient norm decrease in [3], illustrating the best achievable worst-case rate of the gradient norm decrease for any first-order method with either fixed-step or adaptive-step approaches A lower bound of the worst-case rates of first-order methods for decreasing the gradient norm For completeness, this section reviews a lower bound on the worst-case rate of any first-order method in terms of the gradient norm values for smooth convex quadratic functions [3]. Lower bounds on the function value were studied for convex quadratic functions in [3], and for smooth convex functions in [7, 5]. When the condition d N +3 holds, a worst-case gradient norm bound of any first-order method generating x N after N iterations has rate O/N at best, for convex quadratic f, i.e., has the following lower bound [3, Sec..3.B]: LR 4e N + max fx N, 3.7 f Q LR d, x Xf, x 0 x R where Q L R d := { f : fx x Qx+p x+r for x Ê d, Q 0, Q L }. Since Q L R d F L R d, the lower bound 3.7 for convex quadratic functions also applies to smooth convex functions. A regularization technique in [7] achieves the rate O/N up to a logarithmic factor. However, its adaptive step coefficients require knowing R in advance which is undesirable in practice. To our knowledge, whether there exists any FSFOM satisfying such rate is an open question. Instead, this paper discusses a way to develop FSFOM that achieves an O/N.5 gradient norm bound with the smallest constant among known FSFOM. 4 Relaxation and optimization of the cost function form of PEP Thissectionreviewsarelaxationofthecostfunction formofpep[9]and reviewshow[9,5,6]optimized the step coefficients of the FSFOM class over the cost function form of PEP, leading to OGM. Then, we propose a parameterized family of algorithms that generalizes OGM, and analyze the worst-case cost function decrease of the generalized OGM family. 4. Review: Relaxation for the cost function form of PEP The worst-case bound on the cost function for a FSFOM having given step coefficients h := {h i+,k } corresponds to a solution of the following PEP problem [9, Prob. P]: B P h,n,d,l,r := max fx N fx f F LR d, x 0,,x N R d, x X f, x 0 x R s.t. x i+ = x i i h i+,k fx k, i = 0,...,N. L P Since problem P is impractical to solve due to its functional constraint f F L R d, [9] relaxed it by the following finite set of inequalities satisfied by f [5, Thm...5]: L fx i fx j fx i fx j fx j, x i x j 4. 8

9 for i,j = 0,,...,N,. Then, a matrix G = [g 0,,g N ] R N+ d and a vector δ = [δ 0,,δ N ] R N+ with g i := L x 0 x fx i, and δ i := L x 0 x fx i fx are introduced to represent gradient vectors and function values respectively in the set of 4.. This leads to a finite-dimensional relaxation of problem P [9, Prob. Q]: B P h,n,d,l,r := max G R N+ d, δ R N+ LR δ N P s.t. Tr { G A i,j hg } δ i δ j, i < j = 0,...,N, Tr { G B i,j hg } δ i δ j, j < i = 0,...,N, Tr { G C i G } δ i, i = 0,...,N, Tr { G D i hg+νu i G } δ i, i = 0,...,N, for any given unit vector ν R d, where u i = e i+ R N+ is the i+th standard basis vector. Note that Tr { G u i u j G} = g i, g j by definition. The matrices A i,j h,b i,j h,c i,d i h are defined as A i,j h := u i u j u i u j + j l=i+ B i,j h := u i u j u i u j i l=j+ C i := u iu i, D i h := u iu i + i j= l j h j,ku i u k +u ku i. h l,ku j u k +u ku j, l h l,ku j u k +u ku j, In [9, Prob. Q ], problem P is further relaxed by discarding some constraints to yield B P h,n,d,l,r := max G R N+ d, δ R N+ 4. LR δ N P s.t. Tr { G A i,i hg } δ i δ i, i =,...,N, Tr { G D i hg+νu i G} δ i, i = 0,...,N, for any given unit vector ν R d. We explicitly illustrate the relaxation from P to P because Sec. 5 uses a similar but different relaxation. Taylor et al. [3] avoided this step to analyze a tight worst-case bound of P under a large-scale condition d N + [3, Thm. 5]; however, this relaxation P facilitates the analysis in [9,5,6] and in this paper. Replacing max G,δ LR δ N by min G,δ { δ N } for convenience in P, the Lagrangian of the corresponding constrained minimization problem with dual variables λ = λ,,λ N R N and τ = τ 0,,τ N R N+ for the first and second constraint inequalities of P respectively becomes where N N LG,δ,λ,τ;h = δ N + λ i δ i δ i + τ i δ i 4.3 i= i= i=0 +Tr { G Sh,λ,τG+ντ G }, N N Sh,λ,τ := λ i A i,i h+ τ i D i h. 4.4 Then, we have the following dual problem of P that one could use to compute a valid upper bound of P using a semidefinite program SDP for given h [9, Prob. DQ ]: { Sh,λ,τ B D h,n,l,r := min λ,τ Λ, LR γ : τ } τ γ 0, D γ R i=0 9

10 where Λ = { λ,τ R N+ + : } τ 0 = λ, λ N +τ N =,. 4.5 λ i λ i+ +τ i = 0, i =,...,N, The next section reviews the analytical solution to this upper bound D for OGM, instead of using a numerical SDP solver. 4. Review: Optimizing step coefficients for the cost function form of PEP Drori and Teboulle [9] optimized numerically the step coefficients h over the simple SDP problem D as follows [9, Prob. BIL]: ĥ D := argmin h R NN+/ B D h,n,l,r. HD The problem HD is bilinear, and [9, Thm. 3] 4 used a convex relaxation technique to make it solvable by numerical methods. In [5, Lemma 4], we solved HD analytically yielding the optimized step coefficients θ h i+,k = i+ θ k i j=k+ h j,k, k = 0,...,i, θi θ i+, k = i, for θ i in 3.6. Fortuitously, the optimized coefficients 4.6 lead to equivalent computationally efficient OGM and OGM forms [5, Prop. 3, 4 and 5], and the bound 3.5 for the final secondary iterate x N of OGM is implied by [5, Lemma 4]. Recently, Drori [7] showed that the OGM is optimal for d N+, implying that optimizing over the relaxed bound D in HD for simplicity is equivalent to optimizing over the exact worst-case cost function bound P when d N +. One could use a SDP solver to compute a numerical bound from D for any FSFOM; however, deriving an analytical bound using D is difficult for the primary sequence {y i } of OGM. Therefore, we devised a new relaxed bound in [6] similar to D, which we review next. 4.3 Review: Another cost function form of relaxed PEP for the primary sequence of OGM An upper bound of the worst-case bound on fy N+ fx for FSFOM with step coefficients h and y N+ = x N L fx N could be computed using D by a SDP solver. However, we found it difficult to find its analytical worst-case bound for the primary sequence {y i } of OGM, so [6, Prob. D ] provided the following alternate upper bound on fy N+ fx : B D h,n,l,r := min λ,τ Λ, γ R { LR γ : Sh,λ,τ+ u Nu N τ τ γ } 0, D which led to the bound 3.4 for the primary sequence {y i } of OGM in [6]. Similar to [5, Lemma 4], we found a feasible point of D in [6, Lemma 3.], along with feasible step coefficients h of a FSFOM: t h i+,k = i+ t k i j=k+ h j,k, k = 0,...,i, ti t i+, k = i, for t i in 3.3. Then, [6, Thm. 3.] showed the bound 3.4 using [6, Lemma 3.]. The step coefficients 4.6 and 4.7 are identical except the final iteration, since t i 3.3 and θ i 3.6 are equivalent for i < N. We are now done reviewing the portions of papers [5,6] that are the ingredients for specifying a parameterized family of algorithms that generalizes OGM in the next two sections. 4 [9, Thm. 3] has typos that are fixed in [5, Eq. 6.3]. 0

11 4.4 Feasible points of D and D for the generalized OGM This section specifies feasible points of D and D that lead to a generalized version of OGM. Specifically, the following lemma presents additional feasible points of D; this lemma reduces to [5, Lemma 4] and the step coefficients 4.6 of OGM when θ i = Ω i for all i. Lemma For the following step coefficients: h i+,k = the choice of variables: θ i+ Ω i+ θ k i j=k+ h j,k + θi θi+, i = 0,...,N, k = 0,...,i, Ω i+, i = 0,...,N, k = i, γ = Ω τ N, i = 0, 0, λ i = Ω i τ 0, i =,...,N, τ i = θ i τ 0 i =,...,N, θ N τ 0, i = N, is a feasible point of D for any choice of θ i such that θ 0 =, θ i > 0, and θ i Ω i := { i l=0 θ l, i = 0,...,N, N l=0 θ l +θ N, i = N. 4.0 Proof See Appendix B. The followinglemma alsospecifies somefeasible points of D ; this lemma reducesto [6, Lemma3.] and 4.7 when t i = T i for all i. Lemma 3 For the following step coefficients: h i+,k = t i+ T i+ t k i j=k+ h j,k + ti ti+, i = 0,...,N, k = 0,...,i, T i+, i = 0,...,N, k = i, 4. the choice of variables: γ = τ 0, λ i = T i τ 0, i =,...,N, τ i = { T N, i = 0, t i τ 0 i =,...,N, 4. is a feasible point of D for any choice of t i such that Proof See Appendix C. t 0 =, t i > 0, and t i T i := i t l. 4.3 Similar to the relationship between the step coefficients 4.6 and 4.7, the step coefficients 4.8 and 4. are identical when θ i = t i for i < N except for the final iteration, implying that the iterates {x i } N i=0 of the two FSFOMs with 4.8 and 4. are equivalent; only the final iterate x N is different. The step coefficients 4.6 and 4.7 lead to computationally efficient equivalent OGM forms; similarly the next section provides computationally efficient generalized forms of OGM that each correspond to a FSFOM with either 4.8 or 4., and we analyze their cost function worst-case bounds. l=0

12 4.5 Generalized OGM This section proposes a generalized OGM using lemmas and 3. The FSFOM with the step coefficients 4.8 has the following two equivalent efficient generalized forms of OGM, named GOGM and GOGM, that reduce to the standard OGM when θ i = Ω i for all i. Algorithm GOGM Input: f F L R d, x 0 = y 0 R d, θ 0 = Ω 0 =. For i = 0,...,N y i+ = x i L fx i Choose θ i+ > 0 s.t. θi+ Ω i+, where { i+ Ω i+ = l=0 θ l, i N N l=0 θ l +θ N, i = N x i+ = y i+ + Ω i θ i θ i+ θ i Ω i+ y i+ y i + θ i Ω iθ i+ θ i Ω i+ y i+ x i Algorithm GOGM Input: f F L R d, x 0 = y 0 R d, θ 0 = Ω 0 =. For i = 0,...,N y i+ = x i L fx i z i+ = x 0 i θ k fx k L Choose θ i+ > 0 s.t. θi+ Ω i+, where { i+ Ω i+ = l=0 θ l, i N N l=0 θ l +θ N, i = N x i+ = θ i+ y i+ + θ i+ z i+ Ω i+ Ω i+ Proposition The sequence {x 0,,x N } generated by the FSFOM with 4.8 is identical to the corresponding sequence generated by GOGM and GOGM. Proof See Appendix D. Note that this proof is independent of the choice of θ i and Ω i. Because the proof of Prop. for the FSFOM with step coefficients 4.8 is independent of the choice of θ i and Ω i, it is straightforward to show that the FSFOM with step coefficients 4. has the following two efficient equivalent forms, named GOGM and GOGM, that reduce to [6, Alg. OGM and OGM ] when t i = T i for all i. Algorithm GOGM Input: f F L R d, x 0 = y 0 R d, t 0 = T 0 =. For i = 0,...,N y i+ = x i L fx i Choose t i+ > 0 s.t. t i+ T i+ i+ = t l l=0 x i+ = y i+ + T i t i t i+ t i T i+ y i+ y i + t i T it i+ t i T i+ y i+ x i Algorithm GOGM Input: f F L R d, x 0 = y 0 R d, t 0 = T 0 =. For i = 0,...,N y i+ = x i L fx i z i+ = x 0 i t k fx k L Choose t i+ > 0 s.t. t i+ T i+ i+ = t l l=0 x i+ = t i+ y i+ + t i+ z i+ T i+ T i+ Clearly when θ i = t i for i < N, the primary iterates {y i } N i=0 and the intermediate secondary iterates {x i } N i=0 of GOGM and GOGM are equivalent. Although illustrating two similar algorithms GOGM and GOGM might seem redundant, presenting both formulations with lemmas and 3 completes the story of generalized OGM here and in Sec. 5. Using lemmas and 3, the following theorem bounds the cost function decrease of the GOGM and GOGM iterates.

13 Theorem 5 Let f : R d R be F L R d and let y 0,,y N,x N R d be generated by GOGM and GOGM. Then for any i N, fy i fx LR 4Ω i, 4.4 fx N fx LR Ω N. 4.5 The iterates y 0,,y N R d generated by GOGM and GOGM also satisfy the bound 4.4 when θ i = t i for i < N. Proof Using Lemma 3, the FSFOM with h in 4. of GOGM satisfies fy N+ fx B D h,n,l,r = LR γ = LR 4T N, 4.6 where y N+ = x N L fx N. Since the coefficients h in 4. are recursive and do not depend on a given N, we can extend 4.6 for all iterations. By letting θ i = t i for i < N, the bound 4.6 also satisfies for the iterates {y i } of GOGM, as in 4.4. Using Lemma, the FSFOM with the step h in 4.8 of GOGM satisfies which is equivalent to 4.5. fx N fx B D h,n,l,r = LR γ = LR Ω N, 4.7 GOGM and Thm. 5 reduce to OGM and its bounds 3.4 and 3.5, when{ θi = Ω i for all i. Similar i+a a to general forms of FGM in [3,6], the GOGM family includes the choice θ i =, i < N, N+a a, i = N for any a, because such parameter θ i satisfies the following conditions for GOGM: Ω i θ i = i+i+a a i+a a = a i +aa 3 a 0, 4.8 for i < N, and Ω N θ N = Ω N + θ N θ N θ N + θ N θ N = θ N 0. Similarly, the GOGM family includes the choice t i = i+a a for any a, which we denote as OGM-a. Corollary Let f : R d R be F L R d and let y 0,,y N R d be generated by GOGM with t i = i+a a OGM-a for any a. Then for any i N, fy i fx alr ii+a. 4.9 Proof Thm. 5 implies 4.9, since T i = i+i+a a any a. and the condition T i t i 0 satisfies in 4.8 for 5 Relaxation and optimization of the gradient form of PEP This sectionanalyzesaworst-casebound forthe gradientofanygogm and GOGM usingthe gradient form of PEP. We use relaxations on the gradient form of PEP that are similar but slightly different from those of PEP for the cost function in the previous section. Using this relaxed PEP, we prove that FGM has an O/N.5 rate for the worst-case gradient decrease, and analyze the worst-case gradient bound for the GOGM. Then, we optimize the step coefficients with respect to the gradient form of PEP and propose an algorithm named OGM-OG that lies in the GOGM family and that has the best known analytical worst-case bound for decreasing the gradient norm among the class FSFOM. 3

14 5. Relaxation for the gradient form of PEP To analyze a worst-case bound on the gradient for a FSFOM with a given h, we consider the following gradient-form version of PEP that is similar to P: B P h,n,d,l,r := max f F LR d, x 0,,x N R d, x X f, x 0 x R s.t. x i+ = x i L min fx i P i {0,...,N} i h i+,k fx k, i = 0,...,N. Here, we use the smallest gradient norm squared among all iterates min i {0,...,N} fx i as a criteria,asconsideredin[3,sec.4.3].wecouldinsteadconsiderthefinalgradientnormsquared fx N asacriteria,but ourproposedrelaxationonp inthis sectionforsuchcriteriaprovidedonlyano/n worst-case bound at best even for the corresponding optimized step coefficients results not shown; we leave studying the gradient form of the tight PEP as future work. As in [3], we replace min i {0,...,N} fx i in P by L x 0 x α with the condition α L x 0 x fx i = Tr { G u i u i G} for all i. Then, we relax this reformulated P similar to the relaxation from P to P with the additional constraint Tr { G C N G } δ N in P as follows: B P h,n,d,l,r := max G R N+d, δ R N+, α R L R α P s.t. Tr { G A i,i hg } δ i δ i, i =,...,N, Tr { G C N G } δ N, Tr { G D i hg+νu i G } δ i, i = 0,...,N, Tr { G u i u i G} α, i = 0,...,N. Replacing max G,δ L R α by min G,δ { α} for convenience, the Lagrangian of the corresponding constrainedminimizationproblemwithdualvariablesλ R N +,η R +,τ R N+ +,andβ = β 0,,β N R N+ + for the first, second, third, and fourth set of constraint inequalities of P respectively becomes where N N N L G,δ,λ,η,τ,β;h = α+ λ i δ i δ i ηδ N + τ i δ i + β i α 5. i= i=0 i=0 +Tr { G S h,λ,η,τ,βg+ντ G }, S h,λ,η,τ,β := Sh,λ,τ+ N ηu Nu N β i u i u i. 5. Then similar to D, we have the following dual problem of P that one could use to compute an upper bound of the PEP P of the smallest gradient norm squared among all iterates by a numerical SDP solver: S h,λ,η,τ,β τ where B D h,n,l,r := Λ := { min λ,η,τ,β Λ, L R γ : γ R { λ,η,τ,β R 3N+3 + : i=0 τ γ } 0, D N τ 0 = λ, λ N +τ N = η, i=0 β } i =,. 5.3 λ i λ i+ +τ i = 0, i =,...,N The next two sections use a valid upper bound D of P for given step coefficients h, providing an analytical solution to D for the step coefficients h of FGM and GOGM, superseding the use of a SDP solver. 4

15 5. A worst-case bound for the gradient norm of FGM FGM is equivalent to a FSFOM with the step coefficients [5, Prop. ]: t h i+,k = i+ t k i j=k+ h j,k, k = 0,...,i, + ti t i+, k = i, 5.4 for t i in 3.3. The following lemma provides a feasible point of D associated with the step coefficients 5.4 of FGM to provide a worst-case bound for the gradient of FGM. Lemma 4 For the step coefficients 5.4, the following choice of variables: γ = τ 0, λ i = t N, i τ 0, i =,...,N, τ i = k t i = 0, t i τ 0, i =,...,N, 5.5 η = t N τ 0, β i = t i τ 0, i = 0,...,N, 5.6 is a feasible point of D for t i in 3.3. Proof See Appendix E. Using Lemma 4, the following theorem bounds the gradient norm of the FGM iterates, proving for the first time an O/N.5 rate of decrease. Theorem 6 Let f : R d R be F L R d and let y 0,,y N,x 0,,x N R d be generated by FGM. Then for any N, where y N+ = x N L fx N. min fy i min fx i 5.7 i {0,...,N+} i {0,...,N} LR N t k 3LR N +N +6N +, Proof Lemma implies the first inequality in 5.7. Using Lemma 4, the FSFOM with the step coefficients h 5.4 of FGM satisfies min fx i B D h,n,l,r = i {0,...,N} L R γ = L R N, 5.8 t k which is equivalent to 5.7 using N t k N k+ 4 = N+N +3N A worst-case bound for the gradient norm of GOGM Having established the gradient bound 5.7 for FGM, this section and the next seek to improve on it by studying GOGM. To bound the gradient decrease of GOGM and GOGM, the following lemma illustrates one possible set of feasible points of D. Lemma 5 For the step coefficients 4., the following choice of variables: γ = N τ 0, λ i = T i τ 0, i =,...,N, τ i = Tk tk, i = 0, t i τ 0, i =,...,N, 5.9 η = T N τ 0, β i = T i t i τ0, i = 0,...,N 5.0 is a feasible point of D for any choice of t i and T i that satisfies 4.3 and for which there exists some i such that t i < T i. 5

16 Proof See Appendix F. Using Lemma 5, the following theorem bounds the worst-case gradient norm for the iterates of GOGM and GOGM. Theorem 7 Let f : R d R be F L R d and let y 0,,y N,x 0,,x N R d be generated by GOGM. Then for any N, min fy i min fx i i {0,...,N+} i {0,...,N} LR N, 5. T k t k where y N+ = x N L fx N. The bound 5. can be generalized to the intermediate iterates {x i } N i=0 and {y i } N i=0 of both GOGM and GOGM when θ i = t i for i < N. Proof Lemma implies the first inequality in 5.. Using Lemma 5, FSFOM with the step coefficients h 4. of GOGM satisfies min fx i B D h,n,l,r = i {0,...,N} L R L R γ = 4 N T k t k, which implies 5.. Since the iterates of GOGM are recursive and do not depend on a given N, the bound 5. easily generalizes to the intermediate iterates of GOGM and GOGM when θ i = t i. 5.4 Optimizing step coefficients over the gradient form of PEP In search of a FSFOM that decreases the gradient norm the fastest, we optimize the step coefficients in terms of the gradient form of the relaxed D by solving the following problem: ĥ D := argmin h R NN+/ B D h,n,l,r. HD The problem HD is bilinear, similar to HD, and a convex relaxation technique [9, Thm. 3] makes this problem solvable using numerical methods. We solved HD for many choices of N using a numerical SDP solver [4,], and observed that the following choice of t i :, i = 0, t i = + +4t i, i =,..., N/, N i+, i = N/,...,N, 5. makes the feasible point in Lemma 5 optimal for the problem HD. Based on that numerical evidence, we conjecture that ĥd in HD corresponds to the step coefficients 4. with the parameter t i 5.. The t i factors in 5. start decreasing after i = N/, whereas the usual t i in 3.3 and t i = i+a a for any a increase with i indefinitely. In addition, we found numerically that minimizing the gradient bound 5. of GOGM, i.e., solving the following constrained quadratic problem: k max {t i} N l=0 t l t k s.t. t i satisfies 4.3 for all i, 5.3 is equivalent to solving the problem HD. In other words, the solution of 5.3 numerically appears equivalent to 5., the conjectured solution of HD. The unconstrained maximizer of the cost function of 5.3 is t i = N i+, and this term partially appears in the constrained maximizer 5. for N/ i N. We denote the resulting GOGM with 5. as OGM-OG OG for optimized over gradient. The following theorem bounds the cost function and gradient norm of the OGM-OG iterates. 6

17 Theorem 8 Let f : R d R be F L R d and let y 0,,y N,x 0,,x N R d be generated by OGM-OG. Then, where y N+ = x N L fx N. fy N+ fx LR N + min fy i min fx i i {0,...,N+} i {0,...,N}, 5.4 6LR N N +, 5.5 Proof OGM-OG is an instance of GOGM and thus Thm. 5 implies that OGM-OG satisfies which is equivalent to 5.4, since T N = T m + fy N+ fx LR 4T N, N l=m+ t l = t m N +3 +NN N mn m+ 4 = N +8N +9 6 for m = N/, using m N, N m N, and t m m+ N+3 4 in 3.3. Thm. 7 implies 5.5, using the above inequalities for m, the equality t i = T i for i m, and N k=m+ =N mt m + Tk t k = N N k=m+ N m =N mt m + k = k=m+ k k l=m+ l = t m + k l=m+ N l+ N l m+ =N mt m N m N m+k kk k= =N mt m + N m k= k t l t k N k + N k m+ N m+ N m+k +k 4 N m+ +N m+3/4k 4 =N mt N mn m+/n m+ m 6 N mn m+3/4n m+ N mn m+ + 4 N mm+ + N m N m+ N mn m N m N m+ 3 4 N N +. The gradient bound 5.5 of OGM-OG is asymptotically -times smaller than that of FGM in Thm. 6 and.5-times smaller than that of OGM-m = N/3 in Thm. 4. Regarding the cost function decrease, the bound 5.4 of OGM-OG is asymptotically the same as the bound 3. of FGM, and both are twice larger than the bounds 3.4 and 3.5 of OGM. 7

18 5.5 Decreasing the gradient norm with rate O/N.5 using GOGM without selecting N in advance Although OGM-OG satisfies a small worst-case gradient bound with a rate O/N.5, OGM-OG and FGM-m and OGM-m must select N in advance, unlike FGM. Using Thm. 7, the following corollary shows that OGM-a with a > can decrease the gradient with a rate O/N.5 without selecting N in advance. Cor. showed that OGM-a algorithm with a can decrease the cost function with an optimal rate O/N. Corollary Let f : R d R be F L R d and let y 0,,y N,x 0,,x N R d be generated by GOGM with t i = i+a a OGM-a for any a. Then for N, where y N+ = x N L fx N. min fy i min fx i 5.6 i {0,...,N+} i {0,...,N} a 6LR NN +a N +3a 4a, Proof Using T i = i+i+a a and 4.8, Thm. 7 implies 5.6 since N N T k t k = a k +aa 3k a = NN + a N +3a 4a 6a. a OGM-a forany a > hasagradientbound 5.6 that is about -times largerthan the bound 5.5 a of OGM-OG. This constant factor minimizes to when a = 4, and this OGM-a=4 has a worst-case gradient bound that is asymptotically equivalent to the bound 5.7 of FGM. Therefore, when one does not want to select N in advance, both FGM and OGM-a=4 and OGM-a for any a > will be useful for decreasing the gradient with a rate O/N.5. 6 Discussion This section summarizes analytical worst-case bounds of FSFOM discussed in the previous sections. This section also reports tight numerical worst-case bounds for exact comparison of algorithms because many of the analytical bounds are not guaranteed to be tight. 6. Summary of analytical worst-case bounds on the cost function and gradient norm Table summarizes the asymptotic rate of analytical worst-case bounds of all algorithms described in this paper. As discussed, OGM and OGM-OG have the best known worst-case bounds for the cost function and gradient decrease respectively in Table. However, since OGM has a slow worst-case rate for the gradient decrease, other algorithms such as FGM, OGM-m, OGM-OG, and OGM-a that satisfy both the optimal rate O/N for the function decrease and a fast rate O/N.5 for the gradient decrease could be preferableoverogm when one is interested in both the gradientdecrease as well as the function decrease, particularly when solving dual problems. In addition, when one does not want to choose N in advance, FGM and OGM-a could be preferable. 6. Tight worst-case bounds on the cost function and the gradient norm Since many worst-case bounds presented in Table are not guaranteed to be tight, we used the code in Taylor et al. [3] with SDP solvers [0,8] to compare tight numerical worst-case bounds for N =,,4,0,0,30,40,47,50. These numerical worst-case bounds are guaranteed to be tight, i.e., equivalent to the bounds of either P or P, when the large-scale condition d N + is satisfied [3, Thm. 5], and we assume this condition hereafter. Tables and 3 provide tight worst-case bounds for 8

19 Asymptotic worst-case bound Require selecting Algorithm Cost function Gradient norm N in advance GM 4 N N No FGM N 3N.5 No OGM N N No 9 OGM-m= N/3 4 N 3 6 N.5 Yes OGM-OG N 6N.5 Yes a OGM-a a > N a 6 a N.5 No OGM-a=4 N 3N.5 Table Asymptotic worst-case bounds on the cost function LR fx N fx and the gradient norm min i {0,...,N} LR fx i of GM, FGM, OGM, OGM-m, OGM-OG, and OGM-a. The worst-case cost function bound for OGM-m in the table corresponds to the bound for OGM after m iterations, because we do not have an analytical bound for the final iterate. the decrease of the cost function fx N fx and the gradient norm decrease min i {0,...,N} fx i respectively. Although most of the bounds in Table are not guaranteed to be tight, the worst-case rate formulas in Table are similar to the tight numerical results in Tables and 3, except that the gradient bounds of OGM-m in Table are relatively looser than those of OGM-m in Table 3. In particular, the tight numerical gradient bound of OGM-m is smaller than that of FGM in Table 3, which was not expected from their known possibly loose analytical bounds in Table. N GM FGM OGM OGM-m OGM-OG OGM-a Empi. O N.0 N.9 N.9 N.8 N.8 N.8 Known O N N N N N N LR Table Tight worst-case bounds on, the reciprocal of the cost function, of GM, FGM, OGM, OGMm= N/3, OGM-OG, and OGM-a=4. We computed empirical rates by assuming that the bounds follow the form bn c fx N fx with constants b and c, and then by estimating c from points N = 47,50. Note that the corresponding empirical rates are underestimated due to its simple modeling of bounds. N GM FGM OGM OGM-m OGM-OG OGM-a Empi. O N.0 N.4 N 0.9 N.4 N.4 N.4 Known O N N.5 N N.5 N.5 N.5 LR Table 3 Tight worst-case bounds on, the reciprocal of the gradient norm, of GM, FGM, OGM, min i {0,...,N} fx i OGM-m= N/3, OGM-OG, and OGM-a=4. Empirical rates were computed as described in Table. 9

20 6.3 Tight worst-case bounds on the gradient norm at the final iterate To be clear, Sec. 5 and Tables, 3 have focused on analyzing the smallest gradient norm among all iterates using the gradient form of PEP, whereas the gradient analysis in Sec. 3. considers the final gradient in addition to the smallest gradient among all iterates. As mentioned before, we have not yet found a relaxation on the final gradient form of the PEP that provides as comparable results as for the relaxation on the smallest gradient form of the PEP P in Sec. 5. To complete comparisons on the worst-case gradient bounds, Table 4 uses the code provided by Taylor et al. [3] with SDP solvers [0, 8] to compare tight numerical worst-case bounds on the final gradient of the FSFOMs presented in this paper. 5 The worst-case smallest gradient norm bounds 3. and 3.6 of GM and OGM-m respectively among algorithms considered extend to the final gradient bounds. In Table 4, FGM and OGM-a=4 have slow O/N tight worst-case bounds on the final gradient, unlikeogm-m= N/3 andogm-ogroughlyhavingo/n.5 boundsforboth the smallestand final gradients. Thm. 4 has shown that the final gradient of OGM-m satisfies a worst-case rate O/N.5, but this is unknown yet for OGM-OG, which we leave as future work. We also leave as future work the challenge of developing a FSFOM that has O/N.5 or even faster worst-case rates for the final gradient decrease that are lower than those of OGM-m and OGM-OG, possibly without requiring to choose N in advance. N GM FGM OGM OGM OGM OGM-a y N x N y N x N -m -OG y N x N Empi. O N.0 N 0.9 N 0.9 N.4 N.3 N 0.9 Known O N N N N.5 N N LR LR and fx N fy N Table 4 Tight worst-case bounds on, the reciprocal of the final gradient norm, of GM, FGM, OGM, OGM-m= N/3, OGM-OG, and OGM-a=4. Empirical rates were computed as described in Table. The known bounds of OGM-OG and OGM-a=4 are derived based on Sec. 3., where the empirical bounds of OGM-OG are comparable to the bounds of OGM-m= N/3 with known rate O/N Non-optimality of OGM-OG in terms of the worst-case gradient bound Because OGM is optimal in terms of the function decrease when d N + [7], one might hope that the OGM-OG would achieve the optimal worst-case bound in terms of the gradient decrease, since OGM-OG is also derived by optimizing the step coefficients over the gradient form of relaxed PEP. However, the OGM-OG is apparently not optimal as explained next. Taylor et al. [3] numerically studied an optimal fixed-step GM using their tight PEP in terms of both the cost function and gradient decrease. In other words, they searched for an optimal step h of GM: x i+ = x i h L fx i for i = 0,...,N and a given N with respect to either fx N fx or fx N. In the special case of N =, Taylor et al. [3] numerically conjectured that the step size h =.5 is optimal in terms of the 5 Table 4 reports tight worst-case gradient bounds for both the final primary iterate y N and the final secondary iterate x N if necessary, unlike Tables and 3. We observed that numerical tight worst-case cost function bounds on both final iterates y N and x N have similar values for the algorithms in Table unlike Table 4, so we did not report the bounds on y N for simplicity. We also did not report numerical tight smallest worst-case gradient norm bounds min i {0,...,N} fy i of the primary iterates {y i } because the code provided by Taylor et al. [3] does not support computing their values, unlike that of the secondary iterates {x i } in Table 3. 0

21 cost function decrease. The corresponding GM is equivalent to OGM for N =, and this numerically confirms the optimality of OGM [7] for N =. They also numerically conjectured that the optimal step size of GM for N = in terms of the gradient decrease is h = with a worst-case bound fx LR + LR However, OGM-OG for N = reduces to GM with h = 4 LR 3.3 with a bound.3 in Tables 3 and 4, implying that OGM-OG is not optimal even for N = based on the numerical evidence in [3]. This analysis for N = illustrates that there is still room for improvement in accelerating the worstcase rate of first-order methods in terms of gradients, which we leave as future work possibly with a tighter relaxation on the gradient form of PEP. In addition, we leave as future work studying the optimal worst-case bound for the gradient decrease of first-order methods building upon [7, 3], and developing a FSFOM that achieves such optimal bound. Nevertheless, the OGM-OG is the best known FSFOM for decreasing the gradient norm among the class FSFOM, and will be useful when decreasing the gradient is key. 7 Conclusion We generalized the formulation of OGM and analyzed its worst-case bounds on the function value and gradient, using the cost function form and the gradient form of relaxed PEP. We then proposed OGM-OG by optimizing the step coefficients of FSFOM using a relaxed PEP with respect to the gradient, similar to the development of the optimal OGM. To the best of our knowledge, the worst-case bound on the gradient of the OGM-OG is the best known analytical worst-case bound for decreasing the gradient norm among the class FSFOM. However, this OGM-OG is not optimal for decreasing the gradient norm, and further accelerating the worst-case rate of FSFOM in terms of the gradient possibly with a tight relaxation on the gradient form of PEP is a possible research direction. On the other hand, deriving an optimal worst-case bound for the gradient norm of first-order methods, similar to that for the function decrease [7] will be useful. Nonetheless, the proposed OGM-OG and OGM-a may be useful when one finds minimizing gradients important, particularly in dual problems. In addition, we used the proposed gradient form of PEP to show that FGM decreases the smallest gradient with a rate O/N.5, implying that FGM is comparable in a big-o sense to OGM-m, OGM-OG and OGM-a for the gradient decrease. Our analysis considers unconstrained smooth convex minimization; extending such gradient norm worst-case analysis to constrained problems or nonsmooth composite convex problems is a natural direction to pursue, which is studied for FGM or FISTA [] by the authors [7]. In addition, extending the analyses on the general form of FGM in [,3,9] to GOGM is a possible research direction. Lastly, investigating a new relaxation of the PEP approach that allows adaptive step size such as backtracking line-search or exact line-search [5, 8] is of interest. Software has Matlab codes for the algorithms considered and the SDP approaches in Sec. 5.4 and Sec. 6. A Proof of Thm. 3 Due to 3.0, we have max f F L R d, x Xf, x 0 x R min fx i i {0,...,N} max f F L R d, x Xf, x 0 x R fx N LR θ N,

arxiv: v5 [math.oc] 23 Jan 2018

arxiv: v5 [math.oc] 23 Jan 2018 Another look at the fast iterative shrinkage/thresholding algorithm FISTA Donghwan Kim Jeffrey A. Fessler arxiv:608.086v5 [math.oc] Jan 08 Date of current version: January 4, 08 Abstract This paper provides