Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization

Size: px

Start display at page:

Download "Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization"

Justina Robbins
5 years ago
Views:

1 Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization Szilárd Csaba László November, 08 Abstract. We investigate an inertial algorithm of gradient type in connection with the minimization of a nonconvex differentiable function. The algorithm is formulated in the spirit of Nesterov s accelerated convex gradient method. We show that the generated sequences converge to a critical point of the objective function, if a regularization of the objective function satisfies the Kurdyka- Lojasiewicz property. Further, we provide convergence rates for the generated sequences and the objective function values formulated in terms of the Lojasiewicz exponent. Key Words. inertial algorithm, nonconvex optimization, Kurdyka- Lojasiewicz inequality, convergence rate AMS subject classification. 90C6, 90C30, 65K0 Introduction Let g : R m R be a not necessarily convex Fréchet differentiable function with L g -Lipschitz continuous gradient, i.e. there exists L g 0 such that gx gy L g x y for all x, y R m. We deal with the optimization problem P inf gx. x Rm We associate to the following inertial algorithm of gradient type. Consider the starting points x 0 = y 0 R m, and for all n N x n+ = y n s gy n, y n = x n + n + α x n x n, where α > 0, β 0, and 0 < s < β L g. Note that is a nonconvex descendant of the methods of Polyak [3] and Nesterov []. Indeed, in [3], Polyak introduced a modified gradient method for minimizing a smooth convex function g. His two-step iterative method, the so called heavy ball method, takes the following form: x n+ = y n λ n gx n, 3 y n = x n + α n x n x n, Technical University of Cluj-Napoca, Department of Mathematics, Str. Memorandumului nr. 8, 4004 Cluj-Napoca, Romania, laszlosziszi@yahoo.com This work was supported by a grant of Ministry of Research and Innovation, CNCS - UEFISCDI, project number PN-III-P-.-TE , and by a grant of Ministry of Research and Innovation, CNCS - UEFISCDI, project number PN-III-P4-ID-PCE , within PNCDI III.

2 where α n [0, and λ n > 0 is a step-size parameter. In a seminal paper [], Nesterov proposed a modification of the heavy ball method in order to obtain optimal convergence rates for smooth convex functions. More precisely, Nesterov used α n = tn where t n 4t n ++ satisfies the recursion t n+ =, t = and put y n also for evaluating the gradient. Additionally, λ n is chosen in such way that λ n L g. His scheme in its simplest form is given by: x n+ = y n s gy n, y n = x n + t 4 n x n x n, t n+ where s L g. This scheme leads to the convergence rate gx n gx = O /n, where x is a minimizer of the convex function g, and this is optimal among all methods having only information about the gradient of g and consecutive iterates, []. By taking t n = n+a a, a in 4 we obtain an algorithm that is asymptotically equivalent to the original Nesterov method and leads to the same rate of convergence O /n, see [5, 5]. This case has been considered by Chambolle and Dossal [5], in order to prove the convergence of the iterates of the modified FISTA algorithm see [7]. We emphasize that Algorithm has a similar form as the algorithm studied by Chambolle and Dossal see [5] and also [8], but we allow the function g to be nonconvex. Unfortunately, our analysis do not cover the case β =. Su, Boyd and Candès see [5], showed that in case t n = n+ the algorithm 4 has the exact limit the second order differential equation t n+ ẍt + α ẋt + gxt = 0. 5 t with α = 3. Recently, Attouch and his co-authors see [4, 6], proved that, if α > 3 in 5, then the generated trajectory xt converges to a minimizer of g as t +, while the convergence rate of the objective function along the trajectory is o/t. Further, in [5], some results concerning the convergence rate of the objective function g along the trajectory generated by 5, in the subcritical case α 3, have been obtained. However, the convergence of the generated trajectories by 5 in case g is nonconvex is still an open question. Some important steps in this direction have been made in [4] see also [], where convergence of the trajectories of a system, that can be viewed as a perturbation of 5, have been obtained in a nonconvex setting. More precisely in [4] is considered the system ẍt + γ + α t ẋt + gxt = 0, 6 and it is shown that the generated trajectory converges to a critical point of g, if a regularization of g satisfies the Kurdyka- Lojasiewicz property. In what follows we show that by choosing appropriate values of β, the numerical scheme has as the exact limit the continuous second order dynamical systems 5 studied in [4 6, 5], and also the continuous dynamical system 6 studied in [4]. We take to this end in small step sizes and follow the same approach as Su, Boyd and Candès in [5], see also [4]. For this purpose we rewrite in the form x n+ x n = s n + α xn x n s gy n n 7 s and introduce the Ansatz x n xn s for some twice continuously differentiable function x : [0, + R n. We let n = t s and get xt x n, xt + s x n+, xt s x n. Then, as the step size s goes

3 to zero, from the Taylor expansion of x we obtain x n+ x n s = ẋt + ẍt s + o s and x n x n s = ẋt ẍt s + o s. Further, since s gyn gx n sl g y n x n = sl g n + α x n x n = o s, it follows s gy n = s gx n + o s. Consequently, 7 can be written as or, equivalently ẋt + ẍt s + o s = βt t + α ẋt s ẍt s + o s s gxt + o s t + α s ẋt + ẍt s + o s = βt ẋt ẍt s + o s st + α s gxt + o s. Hence, α s + + βt ẍt s + βt + α s ẋt + st + α s gxt = o s. 8 Now, if we take β = γs < in 8 for some s > γ > 0, we obtain α s + γst ẍt s + γst + α s ẋt + st + α s gxt = o s. After dividing by s and letting s 0, we obtain which, after division by t, gives 5, that is tẍt + αẋt + t gxt = 0, ẍt + α ẋt + gxt = 0. t Similarly, by taking β = γ s < in 8, for some s > γ > 0, we obtain α s + γ st ẍt s + γ st + α s ẋt + st + α s gxt = o s. After dividing by s and letting s 0, we get which, after division by t, gives 6, that is tẍt + γt + αẋt + t gxt = 0, ẍt + γ + α t ẋt + gxt = 0. 3

4 Consequently, our numerical scheme can be seen as the discrete counterpart of the continuous dynamical systems 5 and 6, in a full nonconvex setting. The techniques for proving the convergence of use the same main ingredients as other algorithms for nonconvex optimization problems involving KL functions. More precisely, in the next section, we show a sufficient decrease property for the iterates, which also ensures that the iterates gap belongs to l, further we show that the set of cluster points of the iterates is included in the set of critical points of the objective function, and, finally, we use the KL property of an appropriate regularization of the objective function in order to obtain that the iterates gap belongs to l, which implies the convergence of the iterates, see also [3, 8, 3]. Moreover, in section 3, we obtain several convergence rates both for the sequences x n n N, y n n N generated by the numerical scheme, as well as for the function values gx n, gy n in the terms of the Lojasiewicz exponent of g and a regularization of g, respectively for some general results see [6]. The convergence of the generated sequences In this section we investigate the convergence of the proposed algorithm. We show that the sequences generated by the numerical scheme converge to a critical point of the objective function g, provided the regularization of g, Hx, y = gx + y x, is a KL function. The main tool in our forthcoming analysis is the so called descent lemma, see []. Lemma Let g : R m R be Frèchet differentiable with L g Lipschitz continuous gradient. Then gy gx + gx, y x + L g y x, x, y R m. Now we are able to obtain a decrease property for the iterates generated by. Theorem In the settings of problem, for some starting points x 0 = y 0 R m let x n n N, y n n N be the sequences generated by the numerical scheme. Consider the sequences and A n = sl g s C n = sl g s for all n N, n. Then, there exists N N such that + βn + α n + α β + βn + α n + α n + α δ n = A n C n + βn + α sn + α, β s n + α n + α i The sequence gy n + δ n x n x n n N is decreasing and δ n > 0 for all n N. Assume that g is bounded from below. Then, the following statements hold. ii The sequence gy n + δ n x n x n n N is convergent; iii n x n x n < +. 4

5 Proof. From we have gy n = s y n x n+, hence Now, from Lemma we obtain gy n, y n+ y n = s y n x n+, y n+ y n. consequently we have gy n+ gy n + gy n, y n+ y n + L g y n+ y n, and hence Further, Since we have, and gy n+ L g y n+ y n gy n + s y n x n+, y n+ y n. 9 y n x n+, y n+ y n = y n+ y n + y n+ x n+, y n+ y n, y n+ x n+ = βn + n + α + x n+ x n, gy n+ + s L g y n+ y n gy n + y n+ y n = βn+ n+α+ s x n+ x n, y n+ y n. 0 + βn + α + β + x n+ x n n + α + n + α x n x n, y n+ y n = + βn + α + β + x n+ x n n + α + n + α x n x n + βn + α + β + x n+ x n + x n x n n + α + n + α x n+ x n, y n+ y n = + βn + α + β + n + α + x n+ x n, n + α x n+ x n, x n x n, + βn + α + β + x n+ x n n + α + n + α x n x n = + βn + α + β + x n+ x n n + α + n + α x n+ x n, x n x n. Replacing the above equalities in 0, we obtain sl g +βn+α+β+ n+α+ βn++βn+α+β+ n+α+ gy n+ + x n+ x n s sl g n+α gy n x n x n + s sl g +βn+α+β+ n+α n+α+ n+α s 5 βn+ n+α+ x n+ x n, x n x n. =

6 For simplicity let for all n N. Hence we have B n = sl g n+α s gy n+ + A n x n+ x n C n x n+ x n, x n x n gy n B n x n x n. By using the equality we obtain x n+ x n, x n x n = x n+ + x n x n x n+ x n x n x n gy n+ + A n C n x n+ x n + C n x n+ + x n x n gy n + C n B n x n x n. Note that A n C n = δ n+ and let us denote n = B n +A n C n C n. Consequently the following inequality holds. C n x n+ +x n x n + n x n x n gy n +δ n x n x n gy n+ +δ n+ x n+ x n. and Since 0 < β < and s < β L g, we have lim A n = sl gβ + β β s lim B n = sl gβ s > 0, lim C n = sl gβ + β β s lim n = sl g β > 0, s > 0, > 0, lim δ n = β sl g β + > 0. s Hence, there exists N N and C > 0, D > 0 such that for all n N one has C n C, n D and δ n > 0 which, in the view of, shows i, that is, the sequence gy n +δ n x n x n is decreasing for n N. Assume now that g is bounded from below. By using again, we obtain 0 C x n+ + x n x n + D x n x n gy n + δ n x n x n gy n+ + δ n+ x n+ x n, for all n N, or more convenient, that 0 D x n x n gy n + δ n x n x n gy n+ + δ n+ x n+ x n, 3 for all n N. Let r > N. By summing up the latter relation we have D r x n x n gy N + δ N x N x N gy r+ + δ r+ x r+ x r n=n 6

7 which leads to gy r+ + D r x n x n gy N + δ N x N x N. 4 n=n Now, taking into account that g is bounded from below, by letting r + we obtain which proves iii. The latter relation also shows that hence x n x n + n=n lim x n x n = 0, lim δ n x n x n = 0. But then, from the fact that g is bounded from below we obtain that the sequence gy n +δ n x n x n is bounded from below. On the other hand, from i we have that the sequence gy n + δ n x n x n is decreasing for n N, hence there exists lim gy n + δ n x n x n R. Remark 3 Observe that conclusion iii in the hypotheses of Theorem assures that the sequence x n x n n N l, in particular that lim x n x n = 0. 5 Let us denote by ωx n n N the set of cluster points of the sequence x n n N, and denote by critg = {x R m : gx = 0} the set of critical points of g. In the following result we use the distance function to a set, defined for A R n as distx, A = inf y A x y for all x R n. Lemma 4 In the settings of problem, for some starting points x 0 = y 0 R m consider the sequences x n n N, y n n N generated by Algorithm. Assume that g is bounded from below and consider the function Consider further the sequence H : R m R m R, Hx, y = gx + y x. u n = δ n x n x n + y n, for all n N, where δ n was defined in Theorem. Then, the following statements hold true. i ωu n n N = ωy n n N = ωx n n N crit g; ii There exists and is finite the limit lim Hy n, u n ; iii ωy n, u n n N crit H = {x, x R m R m : x crit g}; iv Hy n, u n s x n+ x n + sn+α + δ n x n x n for all n N; 7

8 v Hy n, u n x s n+ x n + sn+α δ n + δn x n x n for all n N; vi H is finite and constant on ωy n, u n n N. Assume that x n n N is bounded. Then, vii ωy n, u n n N is nonempty and compact; viii lim disty n, u n, ωy n, u n n N = 0. Proof. i Let x ωx n n N. Then, there exists a subsequence x nk k N of x n n N such that lim x n k = x. k + Since by 5 lim x n x n = 0 and the sequences δ n n N, that which shows that lim y n k = k + lim u n k = k + lim x n k = x, k + n+α n N ωx n n N ωu n n N and ωx n n N ωy n n N. Further from, the continuity of g and 5, we obtain that gx = lim gy n k = k + s s lim k + lim y n k x nk + = k + [ x nk x nk + + k n k + α x n k x nk ] = 0. converge, we obtain Hence ωx n n N crit g. Conversely, if u ωu n n N then, from 5 results that u ωy n n N and u ωx n n N. Hence, ωu n n N = ωy n n N = ωx n n N crit g. ii is nothing else than ii in Theorem. For iii observe that Hx, y = gx + x y, y x, hence, Hx, y = 0 leads to x = y and gx = 0. Consequently crit H = {x, x R m R m : x crit g}. Further, consider y, u ωy n, u n n N. Then, there exists y nk, u nk k N y n, u n n N such that y, u = lim y n k, u nk = lim x n k, x nk = x, x. k + k + Hence, u = y = x ωx n n N crit g and x, x crit H. iv By using the -norm of R m R m and, for every n N we have Hy n, u n Hy n, u n = gy n + y n u n, u n y n = gy n + y n u n + u n y n gy n + δ n x n x n = s x n + n + α x n x n x n+ + δ n x n x n s x n+ x n + sn + α + δ n x n x n. 8

9 v We use the euclidian norm of R m R m, that is x, y = x + y for all x, y R m R m. We have: Hy n, u n = gy n + y n u n, u n y n = gy n + y n u n + u n y n = s x n x n+ + sn + α δ n x n x n + δ n x n x n s x n+ x n + sn + α δ n + δ n x n x n for all n N. vi follows directly from ii. Assume now that x n n N is bounded and let us prove vii, see also [4]. Obviously y n, u n n N is bounded, hence according to Weierstrass Theorem ωy n, u n n N, and also ωx n n N, is nonempty. It remains to show that ωy n, u n n N is closed. From i and the proof of iii we have ωy n, u n n N = {x, x R m R m : x ωx n n N }. 6 Hence, it is enough to show that ωx n n N is closed. Let be x p p N ωx n n N and assume that lim p + x p = x. We show that x ωx n n N. Obviously, for every p N there exists a sequence of natural numbers n p k +, k +, such that lim k + x n p k = x p. Let be ϵ > 0. Since lim p + x p = x, there exists P ϵ N such that for every p P ϵ it holds x p x < ϵ. Let p N be fixed. Since lim k + x n p = x p, there exists kp, ϵ N such that for every k kp, ϵ it k holds x n p x p < ϵ k. Let be k p kp, ε such that n p k p > p. Obviously n p k p as p + and for every p P ϵ x n p x < ϵ. kp Hence thus x ωx n n N. viii By using 6 we have lim p + x n p kp = x, lim disty n, u n, ωy n, u n n N = lim inf y n, u n x, x. x ωx n n N Since there exists the subsequences y nk k N and u nk k N such that lim k y nk = lim k u nk = x 0 ωx n n N it is straightforward that lim disty n, u n, ωy n, u n n N = 0. 9

10 Remark 5 We emphasize that if g is coercive, that is lim x + gx = +, then g is bounded from below and x n n N, y n n N, the sequences generated by, are bounded. Indeed, notice that g is bounded from below, being a continuous and coercive function see [4]. Note that according to Theorem the sequence D r n=n x n x n is convergent hence is bounded. Consequently, from 4 it follows that y r is contained for every r > N, N is defined in the hypothesis of Theorem, in a lower level set of g, which is bounded. Since y n n N is bounded, taking into account 5, it follows that x n n N is also bounded. In order to continue our analysis we need the concept of a KL function. For η 0, + ], we denote by Θ η the class of concave and continuous functions φ : [0, η [0, + such that φ0 = 0, φ is continuously differentiable on 0, η, continuous at 0 and φ s > 0 for all s 0, η. Definition Kurdyka- Lojasiewicz property Let f : R n R be a differentiable function. We say that f satisfies the Kurdyka- Lojasiewicz KL property at x R n if there exist η 0, + ], a neighborhood U of x and a function φ Θ η such that for all x in the intersection the following inequality holds U {x R n : fx < fx < fx + η} φ fx fx fx. If f satisfies the KL property at each point in R n, then f is called a KL function. The origins of this notion go back to the pioneering work of Lojasiewicz [9], where it is proved that for a real-analytic function f : R n R and a critical point x R n that is fx = 0, there exists θ [/, such that the function f fx θ f is bounded around x. This corresponds to the situation when φs = C θ s θ. The result of Lojasiewicz allows the interpretation of the KL property as a re-parametrization of the function values in order to avoid flatness around the critical points. Kurdyka [7] extended this property to differentiable functions definable in an o-minimal structure. Further extensions to the nonsmooth setting can be found in [, 9 ]. To the class of KL functions belong semi-algebraic, real sub-analytic, semiconvex, uniformly convex and convex functions satisfying a growth condition. We refer the reader to [ 3, 8 ] and the references therein for more details regarding all the classes mentioned above and illustrating examples. An important role in our convergence analysis will be played by the following uniformized KL property given in [8, Lemma 6]. Lemma 6 Let Ω R n be a compact set and let f : R n R be a differentiable function. Assume that f is constant on Ω and f satisfies the KL property at each point of Ω. Then there exist ε, η > 0 and φ Θ η such that for all x Ω and for all x in the intersection {x R n : distx, Ω < ε} {x R n : fx < fx < fx + η} 7 the following inequality holds φ fx fx fx. 8 The following convergence result is the first main result of the paper. Theorem 7 In the settings of problem, for some starting points x 0 = y 0 R m consider the sequences x n n N, y n n N generated by Algorithm. Assume that g is bounded from below and consider the function H : R m R m R, Hx, y = gx + y x. Assume that x n n N is bounded and H is a KL function. Then the following statements are true 0

11 a n x n x n < + ; b there exists x critg such that lim x n = x. Proof. Consider the sequence u n = δ n x n x n + y n, for all n N, that was defined in the hypotheses of Lemma 4. Furthermore, consider x, x ωy n, u n n N. Then, according to Lemma 4, the sequence Hy n, u n is decreasing for all n N, where N was defined in Theorem, further x crit g and lim Hy n, u n = Hx, x. We divide the proof into two cases. Case I. There exists n N, n N, such that Hy n, u n = Hx, x. Then, since Hy n, u n is decreasing for all n N and lim Hy n, u n = Hx, x we obtain that The latter relation combined with 3 leads to Hy n, u n = Hx, x for all n n. 0 D x n x n Hy n, u n Hy n+, u n+ = Hx, x Hx, x = 0 for all n n. Hence x n n n is constant and the conclusion follows. Case II. For every n N one has that Hy n, u n > Hx, x. Let Ω = ωy n, u n n N. Then according to Lemma 4, Ω is nonempty and compact and H is constant on Ω. Since H is KL, according to Lemma 6 there exist ε, η > 0 and φ Θ η such that for all z, w belonging to the intersection one has {z, w R n R n : distz, w, Ω < ε} {z, w R n R n : Hx, x < Hz, w < Hx, x + η} φ Hz, w Hx, x Hz, w. Since lim disty n, u n, Ω = 0, there exists n N such that disty n, u n, Ω < ϵ, n n. and Since there exists n N such that lim Hy n, u n = Hx, x Hy n, u n > Hx, x for all n N, Hx, x < Hy n, u n < Hx, x + η, n n. Hence, for all n n = maxn, n we have φ Hy n, u n Hx, x Hy n, u n. Since φ is concave, for all n N we have φhy n, u n Hx, x φhy n+, u n+ Hx, x

12 φ Hy n, u n Hx, x Hy n, u n Hy n+, u n+, hence, φhy n, u n Hx, x φhy n+, u n+ Hx, x Hy n, u n Hy n+, u n+ Hy n, u n for all n n. Now, from 3 and Lemma 4 iv we obtain φhy n, u n Hx, x φhy n+, u n+ Hx, x 9 D x n x n, s x n+ x n + sn+α + δ n x n x n for all n n. Since lim δ n = β sl gβ+ s > 0 and lim sn+α = β s N N, N n and M > 0 such that max s, sn + α + δ n M, 0 there exists for all n N. Hence, 9 becomes φhy n, u n Hx, x φhy n+, u n+ Hx, x 0 for all n N. Consequently, D x n x n M x n+ x n + x n x n, x n x n M D φhy n, u n Hx, x φhy n+, u n+ Hx, x x n+ x n + x n x n, for all n N. By using the arithmetical-geometrical mean inequality we have M D φhy n, u n Hx, x φhy n+, u n+ Hx, x x n+ x n + x n x n x n+ x n + x n x n 3 for all n N. Hence x n+ x n + x n x n 3 for all n N which leads to + 3M 4D φhy n, u n Hx, x φhy n+, u n+ Hx, x x n x n + 3M 4D φhy n, u n Hx, x φhy n+, u n+ Hx, x x n x n x n+ x n 9M 4D φhy n, u n Hx, x φhy n+, u n+ Hx, x

13 for all n N. Let P > N. By summing up from N to P we obtain P x n x n n=n x N x N + x P + x P + 9M 4D φhy N, u N Hx, x φhy P +, u P + Hx, x. Now, by letting P + and using the fact that φ0 = 0 and 5 we obtain that hence n=n x n x n x N x N + 9M 4D φhy N, u N Hx, x < +, x n x n < + n which is exactly a. Obviously the sequence S n = n k= x k x k is Cauchy, hence, for all ϵ > 0 there exists N ϵ N such that for all n N ϵ and for all p N one has But S n+p S n = n+p k=n+ S n+p S n ϵ. x k x k n+p k=n+ hence the sequence x n n N is Cauchy, consequently is convergent. Let Now, according to Lemma 4 i one has lim x n = x. {x} = ωx n n N crit g x k x k = x n+p x n which proves b. Remark 8 Since the class of semi-algebraic functions is closed under addition see for example [8] and x, y x y is semi-algebraic, the conclusion of the previous theorem holds if the condition H is a KL function is replaced by the assumption that g is semi-algebraic. Remark 9 Note that, according to Remark 5, the conclusion of Theorem 7 remains valid if we replace in its hypotheses the conditions that g is bounded from below and x n n N is bounded by the condition that g is coercive. Remark 0 Note that under the assumptions of Theorem 7 we have lim y n = x and lim gx n = lim gy n = gx. 3

14 3 Convergence rates In this section we will assume that the regularized function H satisfies the Lojasiewicz property, which, as noted in the previous section, corresponds to a particular choice of the desingularizing function φ see [, 9, 9]. Definition Let f : R n R be a differentiable function. The function f is said to fulfill the Lojasiewicz property, if for every x crit f there exist K, ϵ > 0 and θ 0, such that fx fx θ K fx for every x fulfilling x x < ϵ. The number θ is called the Lojasiewicz exponent of f at the critical point x. This corresponds to the case when the desingularizing function φ has the form φt = K θ t θ. In the following theorems we provide convergence rates for the sequence generated by, but also for the function values, in terms of the Lojasiewicz exponent of H see, also, [, 9]. Note that the forthcoming results remain valid if one replace in their hypotheses the conditions that g is bounded from below and x n n N is bounded by the condition that g is coercive. Theorem In the settings of problem consider the sequences x n n N, y n n N generated by Algorithm. Assume that g is bounded from below and that x n n N is bounded, let x critg be such that lim x n = x and suppose that H : R n R n R, Hx, y = gx + x y fulfills the Lojasiewicz property at x, x crit H with Lojasiewicz exponent θ 0, ]. Then, for every p > 0 there exist a, a, a 3, a 4 > 0 and k N such that the following statements hold true: a gy n gx a n p for every n > k, a gx n gx a n p for every n > k, a 3 x n x a 3 n p a 4 y n x a 4 n p for every n > k, for all n > k. Proof. As we have seen in the proof of Theorem 7, if there exists n N, n N, where N was defined in Theorem, such that Hy n, u n = Hx, x, then, Hy n, u n = Hx, x for all n n and x n n n is constant. Consequently y n n n is also constant and the conclusion of the theorem is straightforward. Hence, in what follows we assume that Hy n, u n > Hx, x, for all n N. Let us fix p > 0 and let us prove a. For simplicity let us denote r n = Hy n, u n Hx, x > 0 for all n N. From we have From Lemma 4 v we have Hy n, u n s x n+ x n + n x n x n r n r n+ for all n N. sn + α δ n + δ n x n x n 4

15 for all n N. Let S n = sn+α δ n + δn, for all n N. It follows that, for all n N one has x n x n S n Hy n, u n s S n x n+ x n Hy n, u n S n s r n+ r n+. S n n+ Now by using the Lojasiewicz property of H at x, x crit H, and the fact that lim y n, u n = x, x, we obtain that there exists K, ϵ > 0 and N N, such that for all n N one has y n, u n x, x < ϵ, consequently r n r n+ n Hy n, u n n S n s r n+ r n+ S n n+ n K r θ n n S n s r n+ r n+ S n n+ n K r θ n n+ S n s r n+ r n+ = α n rn+ θ β n r n+ r n+, S n n+ where α n = n K S n and β n = n s S n n+. It is obvious that the sequences α n n N and β n n N are convergent, further lim α n > 0 and lim β n > 0. Now, since 0 < θ and r n+ 0, there exists N N, N N, such that r θ n+ r n+ for all n N. Note that this implies that 0 r n for all n N. Hence, r n α n β n + r n+ + β n r n+, for all n N. Let us define for every n > N the sequence Ξ n = +. Since lim α n β n + + β n+ Ξ n+ there exists k N, k N such that for all n k one has np n+ p n. Then, since p > 0 one has lim p Ξ n = β n + Ξ n α n β n + + β n+ Ξ n+ β n + Ξ n. = lim α n > 0, Consequently, or, equivalently r n r n + + β n+ Ξ n+ β n + Ξ n r n+ β n + Ξ n r n+ + β n r n+, for all n k, + β n+ r n+ + Ξ n+ β n + β n+ Ξ n+ r n+, 3 5

16 for all n k. Now 3 leads to n r k + k=k β k hence after simplifying we get But, β n+ Ξ n+ By denoting Hence, r k + + β k Ξ k r k+ β k = n+p n+ p, hence r k + β k + β k Ξ k r k+ + β r k k+ Ξ k k=k n k=k n k=k + β n k+ r k+ + Ξ k+ k=k β k + β k+ Ξ k+ r k+, β n + β r n+ + k+ + β r n+ n+. 4 Ξ k+ Ξ n+ n n k + p k + p + β = = k+ k + p n + p. Ξ k+ k=k k + p = a, we have a n + p r β n n+ + + n+ β r n+. n+ a n p a n + p r n = gy n gx + δ n x n x n gy n gx 5 which is a. For a we start from Lemma and and we have gx n gy n gy n, x n y n + L g x n y n = x n x n+ + s n + α x n x n, sl g x n x n + n + α s s By using the inequality X, Y x n+ x n, n + α x n x n n + α x n x n x n+ x n, + L g x n x n = n + α. n + α x n x n a X + Y for all X, Y R m, a R \ {0}, we obtain a sl g x n+ x n + sl g x n x n, n + α consequently From 3 we have gx n gy n s sl g x n+ x n. x n x n D gy n + δ n x n x n gy n+ + δ n+ x n+ x n 6

17 and since the sequence gy n + δ n x n+ x n n k is decreasing and has the limit gx, we obtain that gy n+ + δ n+ x n+ x n gx, consequently Hence, for all n k one has x n x n D r n. 6 gx n gy n sd sl g r n+. 7 Now, the identity gx n gx = gx n gy n + gy n gx and a lead to gx n gx for every n > k, which combined with 5 give sd sl g r n+ + a n p gx n gx sd sl g a n + p + a n p a + sd sl g n p = a n p, for every n > k. For a 3 observe, that by summing up from n k to P > n and using the triangle inequality we obtain P x P x n x k x k k=n x n x n + x P + x P + 9M 4D φhy n, u n Hx, x φhy P +, u P + Hx, x. By letting P + we get x n x x n x n + 9M 4D φhy n, u n Hx, x 9M 4D φhy n, u n Hx, x. But, φt = K θ t θ, hence x n x 9MK 4D θ Hy n, u n Hx, x θ = M r θ n, 8 where M = 9MK 4D θ. But assures that 0 r n which combined with θ 0, ] leads to r θ n we have The conclusion follow by 5, since we have x n x M rn. r n, consequently x n x M a n p = a 3 n p for every n > k. Finally, for n > k we have y n x = x n + n + α x n x n x 7

18 Let a 4 = + βa 3. Then for all n > k, which proves a 4. + n + α + n + α x n x + a 3 n p + n + α n + α x n x + n + α a 3 n p a 3. n p y n x a 4, n p Remark In the previous theorem we obtained convergence rates with order p, for every p > 0. This happened when we took in 4 β n+ n + p = Ξ n+ n + p. But actually we have shown more. If one takes β n+ Ξ n+ = ρ n+ > 0 where lim ρ n = 0 then one obtains that there exits k N and A > 0 such that for all n k one has hence 4 becomes α n β n + + ρ n+ β n + ρ n gy n gx A n k=k+ From here, as in the proof of Theorem, one can derive that and x n x = O gx n gx A n k=k+ n k=k+ + ρ k. + ρ k for some A > 0, and y n x = O + ρ k n k=k+. + ρ k Having in mind this general result, and taking into account that in [4], for the dynamical system 6 which, as it is shown in Introduction, can be viewed as the continuous counterpart of the numerical scheme, it was obtained finite time convergence of the generated trajectories for θ 0, and exponential convergence rate for θ =, it seems a valid question whether we can obtain exponential convergence rate for the sequences generated by, by choosing an appropriate sequence ρ n. We show in what follow that this is not possible. We have n k=k+ = e n k=k+ ln+ρk. + ρ k Obviously ln + ρ k > 0, for all k > k and lim k + ln + ρ k = 0. Now, by using the Cesàro-Stolz theorem we obtain that n k=k+ lim ln + ρ k = lim n ln + ρ n+ = 0, 8

19 hence n k=k+ ln + ρ k = on, which shows that O n k=k+ > O e n. + ρ k Remark 3 According to [8], H is KL with Lojasiewicz exponent θ [,, whenever g is KL with Lojasiewicz exponent θ [,. Therefore, we have the following corollary. Corollary 4 In the settings of problem consider the sequences x n n N, y n n N generated by Algorithm. Assume that g is bounded from below and that x n n N is bounded, let x critg be such that lim x n = x and suppose that g fulfills the Lojasiewicz property at x with Lojasiewicz exponent θ =. Then, for every p > 0 there exist a, a, a 3, a 4 > 0 and k N such that the following statements hold true: a gy n gx a n p for every n > k, a gx n gx a n p for every n > k, a 3 x n x a 3 n p a 4 y n x a 4 n p for every n > k, for all n > k. In case the Lojasiewicz exponent of the regularization function H is θ, we have the following result concerning the convergence rates of the sequences generated by. Theorem 5 In the settings of problem consider the sequences x n n N, y n n N generated by Algorithm. Assume that g is bounded from below and that x n n N is bounded, let x critg be such that lim x n = x and suppose that H : R n R n R, Hx, y = gx + x y fulfills the Lojasiewicz property at x, x crit H with Lojasiewicz exponent θ,. Then, there exist b, b, b 3, b 4 > 0 such that the following statements hold true: b gy n gx b, for all n N n θ + ; b gx n gx b, for all n N n θ + ; b 3 x n x b 3, for all n N n θ θ + ; b 4 y n x b 4, for all n > N n θ θ +, where N N was defined in the proof of Theorem. Proof. Also here, to avoid triviality, in what follows we assume that Hy n, u n > Hx, x, for all n N. From we have that for every n N it holds where α n = n K S n and β n = n s S n n+. r n r n+ α n r θ n+ β n r n+ r n+, 9

20 Hence, r n r n+ r θ n+ + β nr n+ r n+ r θ n+ α n, for all n N. Consider the function ϕt = K θ t θ where K is the constant defined at the Lojasiewicz property of H. Then ϕ t = Kt θ and we have Analogously, hence, ϕr n+ ϕr n = rn+ r n ϕ tdt = K rn r n+ t θ dt Kr n r n+ r θ n. ϕr n+ ϕr n+ Kr n+ r n+ r θ n+. Assume that for some n N it holds that rn θ Then r θ n+. ϕr n+ ϕr n + β n ϕr n+ ϕr n+ K r n r n+ r θ n+ + Kβ nr n+ r n+ r θ n+ 9 Conversely, if r θ n ϕr n+ ϕr n = Consequently, Let C = inf n N C +β n α n K r n r n+ r θ n+ + K β nr n+ r n+ r θ n+ K α n. < r θ n+ for some n N, then θ θ r θ n < rn+ θ, K θ r θ n+ r θ n K θ θ rn θ K θ θ r θ = C θ θ N. ϕr n+ ϕr n + β n ϕr n+ ϕr n+ C + β n. > 0. Then, ϕr n+ ϕr n + β n ϕr n+ ϕr n+ C α n. 30 From 9 and 30 we get that there exists C > 0 such that ϕr n+ ϕr n + β n ϕr n+ ϕr n+ Cα n, for all n N. Let β = sup n N β n. Then the latter relation becomes ϕr n+ ϕr n + βϕr n+ ϕr n+ Cα n, for all n N, which leads to Consequently, n k=n ϕr k+ ϕr k + βϕr k+ ϕr k+ C ϕr n+ ϕr N + βϕr n+ ϕr N + C n k=n α k. n k=n α k 0

21 and by using the fact that the sequence r n n N is decreasing and ϕ is also decreasing, we obtain Hence, In other words rn θ Cθ K + β + βϕr n+ C n n Cθ θ r n α k K + β k=n n k=n α k. k=n α k, for all n N +. θ, for all n N +. Since n α k=n k αn N, where 0 < α = inf k N α k we have that there exists M > 0 such that n k=n α k Therefore, we have θ α θ n N θ Cθ θ r n α θ Mn θ K + β But, r n = gy n gx + δ n x n x n, consequently α θ Mn θ, for all n N +. = b n θ, for all n N +. gy n gx b n θ, for all n N + and b is proved. For b observe that 7 holds for all n N, hence for all n N one has gx n gy n sd sl g r n+ Thus, there exists M > 0 such that gx n gx = gx n gy n + gy n gx sd sl g b n + θ. sd sl g b M + b n θ = b n θ, for all n N +. For proving b 3 we use 8. Note that the relation x n x M rn θ holds for all n N. Hence, x n x M b n θ θ, for all n N +. Consequently, x n x b 3 n θ θ, for all n N +, where b 3 = M b θ and this proves b 3. For b 4 observe that for n N + 3 we have y n x = x n + n + α x n x n x

22 + n + α + n + α + n + α where one can take b 4 = sup n N +3 x n x + b 3 n θ θ + b n+α n + α b 3 n + α x n x n + α b 3n θ θ n θ θ b 3 + n+α b 3 According to Remark 3 we have the following corollary. b4 n θ θ, θ n θ n. Corollary 6 In the settings of problem consider the sequences x n n N, y n n N generated by Algorithm. Assume that g is bounded from below and that x n n N is bounded, let x critg be such that lim x n = x and suppose that g fulfills the Lojasiewicz property at x with Lojasiewicz exponent θ,. Then, there exist b, b, b 3, b 4 > 0 such that the following statements hold true: b gy n gx b, for all n N n θ + ; b gx n gx b, for all n N n θ + ; b 3 x n x b 3, for all n N n θ θ + ; b 4 y n x b 4, for all n > N n θ θ +, where N N was defined in the proof of Theorem. 4 Conclusions In this paper we show the convergence of a Nesterov type algorithm in a full nonconvex setting, assuming that a regularization of the objective function satisfies the Kurdyka- Lojasiewicz property. For this purpose as a starting point we show a sufficient decrease property for the iterates generated by our algorithm. Though our algorithm is asymptotically equivalent to Nesterov s accelerated gradient method, we cannot obtain full equivalence due to the fact that in order to obtain the above mentioned decrease property we cannot allow the inertial parameter, more precisely the parameter β, to attain the value. Nevertheless, we obtain convergence rates of order p for every p > 0, for the sequences generated by our numerical scheme but also for the function values in these sequences, provided the objective function, or a regularization of the objective function, satisfies the Lojasiewicz property with Lojasiewicz exponent θ 0, ]. We also show that, at least with our techniques, exponential convergence rates cannot be obtained. In case the Lojasiewicz exponent of the objective function, or a regularization of the objective function, is θ,, we obtain polynomial convergence rates. A related future research is the study of a modified FISTA algorithm in a nonconvex setting. Indeed, let f : R m R be a proper convex and lower semicontinuous function and let g : R m R be a possible nonconvex smooth function with L g Lipschitz continuous gradient. Consider the optimization problem inf fx + gx. x Rm

23 We associate to this optimization problem the following proximal-gradient algorithm. For x 0, y 0 R m consider x n+ = prox sf y n s gy n, y n = x n + n + α x n x n, where α > 0, β 0, and 0 < s < β L g. Obviously, when f 0 then 3 becomes the numerical scheme studied in the present paper. We emphasize that 3 has a similar formulation as the modified FISTA algorithm studied by Chambolle and Dossal in [5] and the convergence of the generated sequences, to a critical point of the objective function f + g, would open the gate for the study of FISTA type algorithms in a nonconvex setting. References [] H. Attouch, J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming 6- Series B, 5-6, 009 [] H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka- Lojasiewicz inequality, Mathematics of Operations Research 35, , 00 [3] H. Attouch, J. Bolte, B.F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods, Mathematical Programming 37- Series A, 9-9, 03 [4] H. Attouch, Z. Chbani, J. Peypouquet, P. Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Math. Program. 68- Ser. B, 3-75, 08 [5] H. Attouch, Z. Chbani, H. Riahi, Rate of convergence of the Nesterov accelerated gradient method in the subcritical case α 3, ESAIM: COCV 07. doi:0.05/cocv/07083 [6] H. Attouch, J. Peypouquet, P. Redont, Fast convex optimization via inertial dynamics with Hessian driven damping, J. Differential Equations 60, , 06 [7] A. Beck, M. Teboulle, A Fast Iterative Shrinkage- Thresholding Algorithm for Linear Inverse Problems, SIAM Journal on Imaging Sciences, 83-0, 009 [8] J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming Series A 46-, , 04 [9] J. Bolte, A. Daniilidis, A. Lewis, The Lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization 74, 05-3, 006 [0] J. Bolte, A. Daniilidis, A. Lewis, M. Shiota, Clarke subgradients of stratifiable functions, SIAM Journal on Optimization 8, , 007 [] J. Bolte, A. Daniilidis, O. Ley, L. Mazet, Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society 366, ,

24 [] R.I. Boţ, E.R. Csetnek, S.C. László, Approaching nonsmooth nonconvex minimization through second-order proximal-gradient dynamical systems, Journal of Evolution Equations [3] R.I. Boţ, E.R. Csetnek, S.C. László, An inertial forward-backward algorithm for minimizing the sum of two non-convex functions, Euro Journal on Computational Optimization 4, 3-5, 06 [4] R.I. Boţ, E.R. Csetnek, S.C. László, A second order dynamical approach with variable damping to nonconvex smooth minimization, Applicable Analysis [5] A. Chambolle, Ch. Dossal, On the convergence of the iterates of the fast iterative shrinkage/thresholding algorithm, J. Optim. Theory Appl. 663, , 05 [6] P. Frankel, G. Garrigos, J. Peypouquet, Splitting Methods with Variable Metric for Kurdyka Lojasiewicz Functions and General Convergence Rates, Journal of Optimization Theory and Applications, 653, , 05 [7] K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l institut Fourier Grenoble 483, , 998 [8] G. Li, T. K. Pong, Calculus of the Exponent of Kurdyka- Lojasiewicz Inequality and Its Applications to Linear Convergence of First-Order Methods, Foundations of Computational Mathematics, -34, 08 [9] S. Lojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles, Éditions du Centre National de la Recherche Scientifique Paris, 87-89, 963 [0] D.A. Lorenz, T. Pock, An inertial forwardbackward algorithm for monotone inclusions, J. Math. Imaging Vis., 5, [] Y.E. Nesterov, A method for solving the convex programming problem with convergence rate O/k, Russian Dokl. Akad. Nauk SSSR 693, , 983 [] Y. Nesterov, Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Dordrecht, 004 [3] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, U.S.S.R. Comput. Math. Math. Phys., 45:-7, 964 [4] R.T. Rockafellar, R.J.-B. Wets, Variational Analysis, Fundamental Principles of Mathematical Sciences 37, Springer-Verlag, Berlin, 998 [5] W. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov s accelerated gradient method: theory and insights, Journal of Machine Learning Research, 7, -43, 06 4

A gradient type algorithm with backward inertial steps for a nonconvex minimization

A gradient type algorithm with backward inertial steps for a nonconvex minimization Cristian Daniel Alecsa Szilárd Csaba László Adrian Viorel November 22, 208 Abstract. We investigate an algorithm of gradient