A gradient type algorithm with backward inertial steps for a nonconvex minimization

A gradient type algorithm with backward inertial steps for a nonconvex minimization Cristian Daniel Alecsa Szilárd Csaba László Adrian Viorel November 22, 208 Abstract. We investigate an algorithm of gradient type with a backward inertial step in connection with the minimization of a nonconvex differentiable function. We show that the generated sequences converge to a critical point of the objective function, if a regularization of the objective function satisfies the Kurdyka- Lojasiewicz property. Further, we provide convergence rates for the generated sequences and the objective function values formulated in terms of the Lojasiewicz exponent. Finally, some numerical experiments are presented in order to compare our numerical scheme with some algorithms well known in the literature. Key Words. inertial algorithm, nonconvex optimization, Kurdyka- Lojasiewicz inequality, convergence rate AMS subject classification. 90C26, 90C30, 65K0 Introduction and Preliminaries Gradient-type algorithms have a long history, going back at least to Cauchy (847), and also a wealth of applications. Solving linear systems, Cauchy s original motivation, is maybe the most obvious application, but many of today s hot topics in machine learning or image processing also deal with optimization problems from an algorithmic perspective and rely on gradient-type algorithms. The original gradient descent algorithm x n+ = x n s g(x n ), which is precisely an explicit Euler method applied to the gradient flow ẋ(t) = g(x(t)), does not achieve very good convergence rates and much research has been dedicated to accelerating convergence. Based on the analogy to mechanical systems, e.g. to the movement, with friction, of a heavy ball in a potential well defined by the objective function g, Polyak [20] was able to provide the seminal idea Tiberiu Popoviciu Institute of Numerical Analysis, Romanian Academy, Cluj-Napoca, Romania and Department of Mathematics, Babes-Bolyai University, Cluj-Napoca, Romania, e-mail: cristian.alecsa@math.ubbcluj.ro. This work was supported by a grant of Ministry of Research and Innovation, CNCS - UEFISCDI, project number PN-III-P-.-TE-206-0266, within PNCDI III. Technical University of Cluj-Napoca, Department of Mathematics, Str. Memorandumului nr. 28, 4004 Cluj-Napoca, Romania, e-mail: laszlosziszi@yahoo.com. This work was supported by a grant of Ministry of Research and Innovation, CNCS - UEFISCDI, project number PN-III-P-.-TE-206-0266, within PNCDI III. Technical University of Cluj-Napoca, Department of Mathematics, Str. Memorandumului nr. 28, 4004 Cluj-Napoca, Romania, e-mail: oviorel@mail.utcluj.ro. This work was supported by a grant of Ministry of Research and Innovation, CNCS - UEFISCDI, project number PN-III-P-.-TE-206-0266, within PNCDI III.

for achieving acceleration namely the addition of inertial (momentum) terms to the gradient algorithm. Probably the most acclaimed inertial algorithm is Nesterov s accelerated gradient method, which in its particular form x n+ = y n s g(y n ), y n = x n + n n + 3 (x n x n ), for a convex g with Lipschitz continuous gradient, exhibits an improved convergence rate of O(/n 2 ) and which, as highlighted by Su, Boyd and Candés [22], can be seen as the discrete counterpart of the second order differential equation ẍ(t) + 3 ẋ(t) g(x(t)) = 0. t In this paper, we deal with the optimization problem inf g(x), x R m where g : R m R is a Fréchet differentiable function with L g -Lipschitz continuous gradient, i.e. there exists L g 0 such that g(x) g(y) L g x y for all x, y R m, and we associate to (P) the following inertial algorithm of gradient type. Consider the starting points x 0 = y 0 R m, and for every n N let x n+ = y n g(y n ), () y n = x n + α n (x n x n ), where we assume that lim α n = α n + ( 0 + ) 68, 0, lim 8 = β and 0 < β < 4α2 + 0α + 2 n + L g (2α + ) 2. Observe that the inertial parameter α n becomes negative after a number of iteration and this can be viewed as we take a backward inertial step in our algorithm. We emphasize that the analysis of the proposed algorithm is intimately related to the properties of the following regularization of the objective function g (see also [0,, 2, 5, 6]), that is H : R m R m R, defined by : H(x, y) = g(x) + M y x 2, M > 0. (2) In the remaining of this section we introduce the necessary apparatus of notions and results that we will need in our forthcoming analysis. For a differentiable function f : R m R we denote by crit(f) = {x R m : f(x) = 0} the set of critical points of f. The following so called descent lemma, see [9], will play an essential role in our forthcoming analysis. Lemma Let f : R m R be Frèchet differentiable with L Lipschitz continuous gradient. Then f(y) f(x) + f(x), y x + L 2 y x 2, x, y R m. Furthermore, the set of cluster points of a given sequence (x n ) n N will be denoted by ω((x n ) n N ). At the same time, the distance function to a set, is defined for A R m as dist(x, A) = inf y A x y for all x Rm. Our convergence result relies on the concept of a KL function. For η (0, + ], we denote by Θ η the class of concave and continuous functions φ : [0, η) [0, + ) such that φ(0) = 0, φ is continuously differentiable on (0, η), continuous at 0 and φ (s) > 0 for all s (0, η). (P) 2

Definition (Kurdyka- Lojasiewicz property) Let f : R n R be a differentiable function. We say that f satisfies the Kurdyka- Lojasiewicz (KL) property at x R n if there exists η (0, + ], a neighborhood U of x and a function φ Θ η such that for all x in the intersection the following inequality holds U {x R m : f(x) < f(x) < f(x) + η} φ (f(x) f(x)) f(x)). If f satisfies the KL property at each point in R m, then f is called a KL function. The function φ is called a desingularizing function (see for instance [5]. The origins of this notion go back to the pioneering work of Lojasiewicz [7], where it is proved that for a real-analytic function f : R m R and a critical point x R m (that is f(x) = 0), there exists θ [/2, ) such that the function f f(x) θ f is bounded around x. This corresponds to the situation when φ(s) = C( θ) s θ, where C > 0 is a given constant, and leads to the following definition. Definition 2 (for which we refer to [7, 7, ].) A differentiable function f : R m R has the Lojasiewicz property with exponent θ (0, ) at x crit(f) if there exist K, ϵ > 0 such that for every x R m, with x x < ϵ. f(x) f(x) θ K f(x), (3) The result of Lojasiewicz allows the interpretation of the KL property as a re-parametrization of the function values in order to avoid flatness around the critical points. Kurdyka [4] extended this property to differentiable functions definable in an o-minimal structure. Further extensions to the nonsmooth setting can be found in [7, 2, 8, 9]. To the class of KL functions belong semi-algebraic, real sub-analytic, semiconvex, uniformly convex and convex functions satisfying a growth condition. We refer the reader to [7, 2, 9, 6, 8, 3, ] and the references therein for more details regarding all the classes mentioned above and illustrating examples. Finally, an important role in our convergence analysis will be played by the following uniformized KL property given in [6, Lemma 6]. Lemma 2 Let Ω R m be a compact set and let f : R m R be a differentiable function. Assume that f is constant on Ω and f satisfies the KL property at each point of Ω. Then there exist ε, η > 0 and φ Θ η such that for all x Ω and for all x in the intersection {x R m : dist(x, Ω) < ε} {x R m : f(x) < f(x) < f(x) + η} (4) the following inequality holds φ (f(x) f(x)) f(x). (5) The outline of the paper is the following. In Section 2 we give a sufficient condition that ensures the decrease property of the regularization H in the iterates, and which also ensures that the iterates gap belongs to l 2. Further, using these results, we show that the set of cluster points of the iterates is included in the set of critical points of the objective function. Finally, by means of the the KL property of H we obtain that the iterates gap belongs to l. This implies the convergence of the iterates, see also [3, 6,, 5]. In Section 3, we obtain several convergence rates both for the sequences (x n ) n N, (y n ) n N generated by the numerical scheme (), as well as for the function values g(x n ), g(y n ) in the terms of the Lojasiewicz exponent of g and H, respectively, see also [3, 5]. Finally, in Section 4, we present some numerical experiment that shows that our algorithm, in many cases, has better properties than the algorithms used in the literature. 3

2 Convergence results We start to investigate the convergence of the proposed algorithm by showing that H is decreasing along certain sequences built upon the iterates generated by (). Theorem 3 Assume that g is bounded from below and consider the sequences (x n ) n N, (y n ) n N generated by the numerical scheme () together with u n = where δ n will be explicitly specified later. Then, there exists N N such that δn M (x n x n ) + y n, (6) (i) The sequence (H(y n, u n )) n N = ( g(y n ) + δ n x n x n 2) n N is decreasing and δ n > 0 for all n N. (ii) The sequence (H(y n, u n )) n N is convergent; (iii) n x n x n 2 < +. Remark 4 An interesting fact is that for the sequence (H(y n, u n )) n N to be decreasing one does not need the boundedness of the objective function g, but only its regularity, as can bee seen below, in the proof of Theorem 3. The energy decay is thus a structural property of the algorithm (). Proof. By applying the descent Lemma to g(y n+ ) we have g(y n+ ) g(y n ) + g(y n ), y n+ y n + L g 2 y n+ y n 2. However, after rewriting the first equation in () as g(y n ) = (y n x n+ ) and taking the inner product with y n+ y n to obtain the descent inequality becomes g(y n ), y n+ y n = y n x n+, y n+ y n, g(y n+ ) L g 2 y n+ y n 2 g(y n ) + y n x n+, y n+ y n. (7) Further, y n x n+, y n+ y n = y n+ y n 2 + y n+ x n+, y n+ y n, y n+ y n = ( + α n+ )(x n+ x n ) α n (x n x n ) and hence y n+ x n+ = α n+ ((x n+ x n ), ( g(y n+ ) + L ) g y n+ y n 2 g(y n ) + α n+ x n+ x n, y n+ y n. (8) 2 Thus we have y n+ y n 2 = ( + α n+ )(x n+ x n ) α n (x n x n ) 2 = 4

and and ( + α n+ ) 2 x n+ x n 2 + α 2 n x n x n 2 2α n ( + α n+ ) x n+ x n, x n x n, x n+ x n, y n+ y n = x n+ x n, ( + α n+ )(x n+ x n ) α n (x n x n ) = Replacing the above equalities in (8) gives ( + α n+ ) x n+ x n 2 α n x n+ x n, x n x n. g(y n+ ) + (2 L g )( + α n+ ) 2 2α n+ ( + α n+ ) x n+ x n 2 2 g(y n ) (2 L g )α 2 n 2 x n x n 2 + (2 L g )α n ( + α n+ ) α n α n+ x n+ x n, x n x n. The above inequality motivates the introduction of the following notations A n = (2 L g )( + α n+ ) 2 2α n+ ( + α n+ ) 2, B n = (2 L g )α 2 n 2, C n = (2 L g )α n ( + α n+ ) α n α n+ 2 δ n := A n + C n and n = A n + B n + C n + C n (9) for all n N, n. In terms of these notations, after using the equality we can write 2 x n+ x n, x n x n = x n+ x n 2 x n+ x n 2 x n x n 2, C n x n+ x n 2 + g(y n+ ) + (A n + C n ) x n+ x n 2 g(y n ) + ( C n B n ) x n x n 2. (0) Consequently, we have C n x n+ x n 2 + n x n x n 2 (g(y n ) + δ n x n x n 2 ) (g(y n+ ) + δ n+ x n+ x n 2 ). () ( ) Now, since α n α, β as n + and α 0+ 68 8, 0, 0 < β < 4α2 +0α+2 L g(2α+) we have 2 lim A n = (2 βl g)(α + ) 2 + 2α 2α 2, n + 2β lim B n = (2 βl g)α 2, n + 2β lim C n = (2 βl g)α( + α) α 2 < 0, n + 2β lim n = (2 βl g)(2α + ) 2 + 2α 4α 2 > 0, n + 2β lim δ n = (2 βl g)(2α 2 + 3α + ) + 2α 3α 2 > 0. n + 2β 5

Hence, there exists N N and C > 0, D > 0 such that for all n N one has C n C, n D and δ n > 0 which, in the view of (), shows (i), that is, the sequence g(y n )+δ n x n x n 2 is decreasing for n N. By using () again, we obtain 0 < C x n+ x n 2 + D x n x n 2 (g(y n ) + δ n x n x n 2 ) (g(y n+ ) + δ n+ x n+ x n 2 ), for all n N, or, more convenient, that 0 < D x n x n 2 (g(y n ) + δ n x n x n 2 ) (g(y n+ ) + δ n+ x n+ x n 2 ), (2) for all n N. Let r > N. Summing up the latter relations gives D which leads to r x n x n 2 (g(y N ) + δ N x N x N 2 ) (g(y r+ ) + δ r+ x r+ x r 2 ) n=n g(y r+ ) + D r x n x n 2 g(y N ) + δ N x N x N 2. (3) n=n Now, taking into account that g is bounded from below, after letting r + we obtain which proves (iii). This also shows that hence x n x n 2 < + n=n lim x n x n 2 = 0, n + lim δ n x n x n 2 = 0. n + But then, using again the fact that g is bounded from below we have that the sequence g(y n ) + δ n x n x n 2 is bounded from below and also decreasing (see (i)) for n N, hence there exists lim g(y n) + δ n x n x n 2 R. n + Remark 5 Observe that conclusion (iii) in the hypotheses of Theorem 3 assures that the sequence (x n x n ) n N l 2, in particular that lim (x n x n ) = 0. (4) n + Lemma 6 In the framework of the optimization problem (P) assume that the objective function g is bounded from below and consider the sequences (x n ) n N and (y n ) n N generated by the numerical algorithm () and let (u n ) n N defined by (6). Then, the following statements are valid : (i) the sets of cluster points of (x n ) n N, (y n ) n N and (u n ) n N coincide and are contained in the set o critical points of g, i.e., ω((x n ) n N ) = ω((y n ) n N ) = ω((u n ) n N ) crit(g); 6

(ii) ω((y n, u n ) n N ) crit(h) = {(x, x) R m R m : x crit(g)}. Proof. (i) We start by proving ω((x n ) n N ) ω((u n ) n N ) and ω((x n ) n N ) ω((y n ) n N ). Bearing in mind that lim (x n x n ) = 0 and that the sequences (δ n ) n N, (α n ) n N and ( ) n N are convergent, n the conclusion is quite straightforward. Indeed, if x ω((x n ) n N ) and (x nk ) k N is a subsequence such that lim x n k = x then k and lim y n k = k lim u n k = k lim x n k + lim α n k lim (x n k x nk ) k k k lim δ n k lim (x n k x nk ) + k k lim k y n k imply that the sequences (x nk ) k N, (y nk ) k N and (u nk ) k N all converge to the same element x R m. In order to prove that ω((x n ) n N ) crit(g) we use the fact that g is a continuous operator. So, passing to the limit in g(y nk ) = (y nk x nk +) gives k g(x) = lim n k ) k = lim lim n k x nk +) n k k k and finally, as y nk x nk + = (x nk x nk +) + α nk (x nk x nk ), to g(x) = 0. The reverse inclusions follow in a very similar manner from the definitions of u n and y n. For proving the statement (ii), we rely on a direct computation yielding H(x, y) = ( g (x) + 2M (x y), 2M (y x)), (5) which, in turn, gives crit(h) = { (x, x) R m R m } : x crit(g) and allows us to apply (i) to obtain the desired conclusion. Some direct consequences of Theorem 3 (ii) and Lemma 6 are the following. Fact 7 In the setting of Lemma 6, let (x, x) ω((y n, u n ) n N ). It follows that x crit(g) and lim H(y n, u n ) = H(x, x). n Consequently, H is finite and constant on the set ω((y n, u n ) n N ). The arguments behind the proofs of the following facts are the same as those in Lemma 4 from [5]. Fact 8 If the assumptions from Lemma 6 hold true and if the sequence (x n ) n N is bounded, then the following conclusions hold up : (i) ω((y n, u n ) n N ) is nonempty and compact, (ii) lim dist((y n, u n ), ω((y n, u n ) n N )) = 0. n + 7

Remark 9 We emphasize that if g is coercive, that is lim x + g(x) = +, then g is bounded from below and (x n ) n N, (y n ) n N, the sequences generated by (), are bounded. Indeed, notice that g is bounded from below, being a continuous and coercive function (see [2]). Note that according to Theorem 3 the sequence D r n=n x n x n 2 is convergent hence is bounded. Consequently, from (3) it follows that y r is contained for every r > N, (N is defined in the hypothesis of Theorem 3), in a lower level set of g, which is bounded. Since (y n ) n N is bounded, taking into account (4), it follows that (x n ) n N is also bounded. Now, based on the conclusions of Lemma 6, we present a result which will be crucial latter on. For our next result, will denote the -norm and 2 will represent the 2-norm on the linear space R m R m. Lemma 0 Let H, (x n ) n N, (y n ) n N and (u n ) n N be as in all the previous results, with the mapping g bounded from below. Then, the following gradient inequalities hold true [ H(y n, u n ) 2 H(y n, u n ) (αn ) ] δn x n+ x n + + 4M x n x n (6) M and H(y n, u n ) 2 2 2 β 2 n ( x n+ x n 2 + 2 α n 2M ) 2 δn + 2δ n M x n x n 2. (7) M Proof. First of all note that from our numerical scheme () we have g(y n ) = ((x n x n+ )+α n (x n x n )). In terms of the on R m R m, we have H(y n, u n ) = ( g(y n ) + 2M (y n u n ), 2M (u n y n )) = g(y n ) + 2M (y n u n ) + 2M u n y n x n+ x n + α n x n x n + 4M which proves the desired inequality. Now, with respect to the Euclidean norm, similar arguments yield δn M x n x n, H(y n, u n ) 2 2 = g(y n ) + 2M(y n u n ) 2 + 2M(u n y n ) 2 δn ( ) 2 = g(y n ) 2M M (x n x n ) 2 + 4M 2 δn x n x n 2 M ( 2 ) βn 2 x n+ x n 2 α n δn 2 + 2 2M (x n x n ) + 4Mδ M n x n x n 2, completing the proof. Our main result concerning the convergence of the sequence (x n ) n N generated by the algorithm () to a critical point of the objective function g is the following. Theorem Consider the sequences (x) n N, (y n ) n N generated by the algorithm () and let the objective function g be bounded from below. If the sequence (x n ) n N is bounded and H is a KL function, then x n x n < + (8) n= and there exists an element x crit(g) such that lim x n = x. n + 8

Proof. Consider ( x, x) from the set ω((y n, u n ) n N ) under the assumptions of Lemma (6). It follows that x crit(g). Also, using Fact (7), we get that lim H(y n, u n ) = H( x, x). Furthermore, we consider n two cases: I. By using N from Theorem (3), assume that there exists n N, with n N, such that H(y n, u n ) = H( x, x). Then, since (H(y n, u n )) n N is a decreasing sequence, it follows that H(y n, u n ) = H(x, x), for every n n, Now, using (2) we get that for every n n we have the following inequality: 0 D x n x n 2 H(y n, u n ) H(y n+, u n+ ) = H( x, x) H( x, x) = 0. So, the sequence (x n ) n n is constant and the conclusion holds true. II. Now, we deal with the case when H(y n, u n ) > H(x, x), for every n N. So, consider the set Ω := ω((y n, u n ) n N ). From Fact (8) we have that the set Ω is nonempty and compact. Also, Fact (7) assures that mapping H is constant on Ω. From the hypotheses of the theorem we have that H is a KL function. So, according to Lemma 2, there exists ε > 0, η > 0 and a function φ Θ η, such that for all the points (z, w) from the set {(z, w) R m R m : dist((z, w), Ω) < ε} {(z, w) R m R m : H( x, x) < H(z, w) < η + H( x, x)} one has that On the other hand, using Fact (8), we obtain that exists an index n N, for which it is valid Because and since φ (H(z, w) H( x, x)) H(z, w). lim dist((y n, u n ), Ω) = 0. This means that there n + dist((y n, u n ), Ω) < ε, for all n n. lim H(y n, u n ) = H( x, x) n + H(y n, u n ) > H( x, x), then there exists another index n 2 N, such that for all n N H( x, x) < H(y n, u n ) < H( x, x) + η, for every n n 2. Taking n := max(n, n 2 ) we get that for each n n it follows Since the function φ is concave, we have φ (H(y n, u n ) H( x, x)) H(y n, u n ). φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x)) φ (H(y n, u n ) H( x, x)) (H(y n, u n ) H(y n+, u n+ )). Thus, the following relation takes place for each n n: φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x)) H(y n, u n ) H(y n+, u n+ ). H(y n, u n ) 9

On one hand, combining the inequality (2) and (6), it follows that for every n n φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x)) x n x n+ + D x n x n [ 2 α n + 4M δn M ] x n x n. (9) On the other hand, we know that the sequences (α n ) n N, ( ) n N and (δ n ) n N are convergent, and ( ) ( ) α n δn lim n + = β > 0, hence and + 4M are bounded. This shows that there M n N exists N N, N n and there exists M > 0, such that { α n, + 4M Thus, the inequality (9) becomes sup n N δn M } n N M. φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x)) D x n x n 2 M ( x n x n+ + x n x n ), (20) for every n N. This implies that for each n N, the following inequality holds : M x n x n D (φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x))) ( x n x n+ + x n x n ). From the well-known arithmetical-geometrical inequality, it follows that M D (φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x))) ( x n x n+ + x n x n ) x n+ x n + x n x n 3 Therefore, we obtain x n+ x n + x n x n 3 Consequently, we have + 3M 4D (φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x))). x n x n + 3M 4D (φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x))). 2 x n x n x n x n+ (2) 9M 4D (φ(h(y n, u n ) H( x, x)) φ(h(y n+, u n+ ) H( x, x))), for every n N, with n N. Now, by summing up the latter inequality from N to P N, we get that P x n x n x P + x P x N x N n= N + 9M 4D (φ(h(y N, u N ) H( x, x)) φ(h(y P +, u P + ) H( x, x))). 0

Now, it is time to use the fact that φ(0) = 0. In this setting, by letting P + and by using (4) we obtain It implies that x n x n x N x N + 9M φ(h(y N, u N ) H( x, x)) < +. 4D n= N x n x n < +, n= so the first part of the proof is done. At the same time, the sequence (S n ) n N, defined by S n = n x i x i i= is Cauchy. Thus, for every ε > 0, there exists a positive integer number N ε, such that for each n N ε and for all p N, one has S n+p S n ε. Furthermore, S n+p S n = n+p i=n+ x i x i n+p i=n+ (x i x i ) = x n+p x n. So, the sequence (x n ) n N is Cauchy, hence is convergent, i.e. there exists x R m, such that Thus, by using (i) of Lemma (6), it follows that lim x n = x. n + {x} = ω((x n ) n N ) crit(g), which leads to the conclusion of the second part of the present theorem. Remark 2 Since the class of semi-algebraic functions is closed under addition (see for example [6]) and (x, y) M x y 2 is semi-algebraic, the conclusion of the previous theorem holds if the condition H is a KL function is replaced by the assumption that g is semi-algebraic. Note that, according to Remark 9, the conclusion of Theorem remains valid if we replace in its hypotheses the conditions that g is bounded from below and (x n ) n N is bounded by the condition that g is coercive. Note that under the assumptions of Theorem we have lim n + y n = x and 3 Convergence rates lim g(x n) = lim g(y n) = g(x). n + n + In the following theorems we provide convergence rates for the sequence generated by (), but also for the function values, in terms of the Lojasiewicz exponent of H (see, also, [, 7, 5]). Note that the forthcoming results remain valid if one replace in their hypotheses the conditions that g is bounded from below and (x n ) n N is bounded by the condition that g is coercive.

Theorem 3 In the settings of problem (P) consider the sequences (x n ) n N, (y n ) n N generated by Algorithm (). Assume that g is bounded from below and that (x n ) n N is bounded, let x crit(g) be such that lim n + x n = x and suppose that H : R m R m R, H(x, y) = g(x) + M x y 2 fulfills the Lojasiewicz property at (x, x) crit H with Lojasiewicz exponent θ ( 0, 2]. Then, for every p > 0 there exist a, a 2, a 3, a 4 > 0 and k N such that the following statements hold true: (a ) g(y n ) g(x) a n p for every n k, (a 2 ) g(x n ) g(x) a 2 n p for every n k, (a 3 ) x n x a 3 n p 2 (a 4 ) y n x a 4 n p 2 for every n k, for all n k. Proof. We start by employing the ideas from the proof of Theorem 3, namely if there exists n N, with n N, for which one has that H(y n, u n ) = H(x, x), then it follows that the sequence (x n ) n n is constant. This leads to the fact that the sequence (y n ) n n is also constant. Furthermore, H(y n, u n ) = H(x, x) for all n n. That is, if the regularized energy is constant after a certain number of iterations, one can see that the conclusion follow in a straightforward way. Now, we can easily assume that In order to simplify notations we will use H(y n, u n ) > H(x, x) for all n N. H n := H(y n, u n ), H := H(x, x) and H n := H(y n, u n ). Our analysis aims at deriving a difference inequality for r n := H n H = H(y n, u n ) H(x, x) > 0 based on three previously established fundamental relations: a) the energy decay relation (2), for every n N H n H n+ D x n x n 2, b) the energy-gradient estimate (7), for every n N H n 2 2 βn 2 x n+ x n 2 + S n x n x n 2 where ( S n = 2 α n 2M ) 2 δn + 2δ n M, M 2

c) the Lojasiewicz inequality (3), for every n N (H n H) 2θ K 2 H n 2, where N is defined by the Lojasiewitz property of H at (x, x) such that for ϵ > 0 one has (y n, u n ) (x, x) < ϵ for all n N. By combining these three inequalities, one reaches (H n H) 2θ 2K2 β 2 nd (H n+ H n+2 ) + K2 S n D (H n H n+ ) and we are led to a nonlinear second order difference inequality r 2θ n 2K2 β 2 nd (r n+ r n+2 ) + K2 S n D (r n r n+ ), (22) for every n N. Using the fact that the positive sequence (r n ) n N is decreasing and converging to zero we have that r n+ r n and that r n < for n large enough. These observations lead, in the view θ (0, /2], to the existence of N 2 N, N 2 N such that rn 2θ θ-independent difference inequality for every n N 2, with the shorthand notations r 2θ n+ r n+ for all n N 2 and (22) reduces to the linear, r n µ n r n+2 + ( + λ n µ n )r n+, (23) λ n := D K 2 and µ n := 2 S n βns 2. n In order to derive our first convergence rate we proceed as follows: for any p > 0 we consider the diverging sequence (ξ n ) n N 2, ξ n +, defined by Since ξ n := lim ( + λ n µ n ) µ n+ n ξ n+ µ n n p (n + ) p n p. µ n + µ n ξ n + = lim λ n > 0, n there exists a large enough index k N, k N 2, such that for each n k, one has + λ n µ n µ n+ µ n ξ n+ + µ + (24) n ξ n Combining the above inequality with the difference inequality for r n yields after some computations r n + µ ( n + µ r n n+ + µ ) n+ µ n r n+ + ξ n+ + µ r n+ n+2 ξ n ξ n+ 3

for each n k, and after multiplying these inequalities from k to n and simplifying we have µ r k + r k k+ + µ n k + µ µ n r i+ n+ + r n+2 + µ. (25) n+ ξ i=k k ξ i+ ξ n+ ) p, which implies Now, using the fact that µ n+ ξ n+ = we obtain µ k where a = r k + r k+ + µ k ξk a ( n i=k ( n + 2 n + n n + µ = i+ ξ i=k i+ finally yields the desired estimate ( ) i + p = i + 2 ( ) p k +, n + 2 ( ) p µ n a r n+ + r n+2 n + 2 + µ n+ ξ n+ (k + ) p. Now, from the definition of r n ) p a ( n + ) p r n = g(y n ) g( x) + δ n x n x n 2, g(y n ) g( x) r n a n p. (26) In order to give an upper bound for the difference g(x n ) g( x), we consider the following chain of inequalities based upon Lemma : g(x n ) g(y n ) g(y n ), x n y n + L g 2 x n y n 2 = (y n x n+ ), α n (x n x n ) + L g 2 x n y n 2 = x n+ x n, α n (x n x n ) α 2 2 L g n x n x n 2. 2 Here, using the inequality x n+ x n, α n (x n x n ) [ ] x n+ x n 2 + (2 L g )α 2 2 2 L n x n x n 2, g leads, after some simplifications, to g(x n ) g(y n ) 2 (2 L g ) x n+ x n 2. By combining the inequality (2) with the fact that the sequence (g(y n )+δ n x n x n 2 ) n k is decreasing and converges to g(x), one obtains g(x n ) g(y n ) 2D (2 L g ) r n+, for all n k. (27) 4

From (26) we have r n+ a (n + ) p a n p, hence g(x n ) g(y n ) 2D (2 L g ) a, for all n k. np This means that for every n > k one has [ ] g(x n ) g( x) = (g(x n ) g(y n )) + (g(y n ) g(x)) a 2D (2 L g ) + n p. Since the sequence ( ) n N is convergent to β > 0, we can choose and we have a 2 = a sup n k 2D (2 L g ) + a g(x n ) g( x) a 2, for every n k. (28) np We continue the proof by establishing an estimate for x n x. By the triangle inequality and by summing up (2) from n k to P > n one has x P x n so, letting P gives P x k x k k=n x n x n + x P + x P + 9M [ φ(hn H) φ(h P + H) ], 4D x n x 9M 4D φ(h n H). Recall, however, that the desingularizing function is φ(t) = K θ t θ hence, x n x M r θ n, (29) with M = 9M K 4D( θ). Further, due to 0 < r n < combined with rn θ r n which holds for θ (0, /2], we conclude that where a 3 := a M. x n x M rn+ M rn a 3 ( n) p/2, for every n k, (30) Finally, we conclude this part of the proof by deducing an upper bound for y n x. The following inequalities hold true y n x = x n + α n (x n x n ) x + α n x n x + α n x n x ( + α n )a 3 + α n p n a 3 n p ( + 2 α n ) a 3 n p. 5

Let a 4 = sup n k ( + 2 α n )a 3 > 0. Then, y n x a 4, for all n k. (3) n p In case the Lojasiewicz exponent of the regularization function H is θ ( 2, ) we have the following result concerning the convergence rates of the sequences generated by (). Theorem 4 In the settings of problem (P) consider the sequences (x n ) n N, (y n ) n N generated by Algorithm (). Assume that g is bounded from below and that (x n ) n N is bounded, let x crit(g) be such that lim n + x n = x and suppose that H : R m R m R, H(x, y) = g(x) + M x y 2 fulfills the Lojasiewicz property at (x, x) crit H with Lojasiewicz exponent θ ( 2, ). Then, there exist b, b 2, b 3, b 4 > 0 such that the following statements hold true: (b ) g(y n ) g(x) b, for all n N n 2θ + 2; (b 2 ) g(x n ) g(x) b 2, for all n N n 2θ + 2; (b 3 ) x n x b 3, for all n N n 2θ θ + 2; (b 4 ) y n x b 4, for all n > N n 2θ θ + 2, where N N was defined in the proof of Theorem 3. Proof. To avoid triviality also here we assume that H n > H for all n N. We return to (22) which, after using r n+ r n, we rewrite as for all n N. Now, considering the real-valued function ϕ(t) = has that and by similar arguments (r n r n+ )r 2θ n+ + µ n(r n+ r n+2 )r 2θ n+ λ n, (32) ϕ(r n+ ) ϕ(r n ) = Assume that for some n N one has rn+ r n K 2θ t 2θ whose derivative is ϕ (t) = Kt 2θ, one ϕ (t)dt K(r n r n+ )r 2θ n ϕ(r n+2 ) ϕ(r n+ ) K(r n+ r n+2 )r 2θ n+. r 2θ n This leads to the following chain of inequalities : 2 r 2θ n+. ϕ(r n+ ) ϕ(r n ) + µ n (ϕ (r n+2 ) ϕ (r n+ )) K 2 (r n r n+ )r 2θ n+ + Kµ n(r n+ r n+2 )r 2θ n+ K 2 (r n r n+ )r 2θ n+ + K 2 µ n(r n+ r n+2 )r 2θ n+ (33) K 2 λ n. 6

On the other hand, if 2rn 2θ < rn+ 2θ, for some n N, then we obtain 2 2θ 2θ r 2θ n < rn+ 2θ. As a consequence, it follows that ϕ(r n+ ) ϕ(r n ) = K 2θ (r 2θ n+ r 2θ n ) K ) (2 2θ 2θ rn 2θ K ) (2 2θ 2θ r 2θ = C 2θ 2θ N. This implies that Now, let us denote ϕ(r n+ ) ϕ(r n ) + µ n (ϕ(r n+2 ) ϕ(r n+ )) C ( + µ n ). C ( + µ n ) inf n N λ n by C 2. Then ϕ(r n+ ) ϕ(r n ) + µ n (ϕ(r n+2 ) ϕ(r n+ )) C 2 λ n. (34) From the inequalities (33) and (34), it follows that there exist C > 0, such that ϕ(r n+ ) ϕ(r n ) + µ n (ϕ(r n+2 ) ϕ(r n+ )) Cλ n, for every n N. By taking µ = sup n N µ n and summing over the indices we reach Consequently n However, recall that both (r n ) n N and consequently for all n N + 2. Therefore, k=n ((ϕ(r k+ ) ϕ(r k )) + µ(ϕ(r k+2 ) ϕ(r k+ ))) C r n ϕ(r n+ ) ϕ(r N ) + µ(ϕ(r n+2 ) ϕ(r N k+ )) C n k=n λ k. n k=n λ k. and the function ϕ are decreasing, so we have ( + µ)ϕ(r n+2 ) C rn 2θ C(2θ ) K( + µ) ( ) C(2θ ) n 2 2θ λ k K( + µ) k=n n k=n λ k n 2 k=n λ k, 2θ for all n N + 2. The desired convergence rate, now follows from r n = g(y n ) g(x) + δ n x n x n 2 and n 2 k=n λ k λ(n N ), where 0 < λ = inf k N λ k. In fact, we have n 2 k=n λ k 2θ λ 2θ (n N ) 2θ λ 2θ Mn 2θ 7

for some M > 0 and for all n N + 2. Let b = ( ) C(2θ ) 2θ K(+µ) λ 2θ M. Then, g(y n ) g(x) b n 2θ, for all n N + 2. The other claims now follow quite easily. Indeed, note that (27) holds for every n N, hence g(x n ) g(y n ) 2D (2 L g ) r n+ Therefore, one obtains ( g(x n ) g(x) = (g(x n ) g(y n )) + (g(y n ) g(x)) 2D (2 L g ) b (n + ) 2θ. 2D (2 L g ) b + b ) n 2θ or every n N + 2. For proving (b 3 ), we use (29) again, and we have that for all n N + 2 it holds x n x M r θ n M (b n 2θ ) θ = b 3 n θ 2θ. = b2 n 2θ, The final estimate is a straightforward consequence of the definition of y n and the above estimates. Indeed, for all n > N + 2 one has y n x = x n + α n (x n x n ) x + α n x n x + α n x n x ( + α n )b 3 n θ 2θ + αn b 3 (n ) θ 2θ Let b 4 = sup n N +3 ( + 2 α n )b 3 ( n n ) θ 2θ > 0. Then, ( + 2 αn )b 3 (n ) θ 2θ. y n x b 4 n θ 2θ, for all n > N + 2. (35) Remark 5 According to [6], H is KL with Lojasiewicz exponent θ [ 2, ), whenever g is KL with Lojasiewicz exponent θ [ 2, ). Therefore, in the hypotheses of Theorem 3 and Theorem 4 it is enough to assume that g is KL with Lojasiewicz exponent θ [ 2, ). 4 Numerical experiments The aim of this section is to support the analytic results of the previous sections by numerical experiments and to highlight some interesting features of the generic algorithm (). To this end, let us consider Algorithm x n+ = y n s g(y n ), y n = x n 0.n n + 3 (x (36) n x n ), which is a specific version of the generic algorithm () with a constant, small enough, step-size and an always negative coefficient, i.e., = s and α n = 0.n n + 3 with 8 ( ) lim α n = 0. 0+ 68 n + 8, 0.

In order to give a better perspective on its advantages and disadvantages, we compare the proposed algorithm with: a) the standard Nesterov algorithm, in the case of convex objective functions; b) a Nesterov-like algorithm (see [5]), in the nonconvex case. In this respect, it may be useful to recall that until recently little was known about the efficiency of Nesterov s accelerated gradient method outside a convex setting. However, in [5], a Nesterov-like method, differing from the original only by a multiplicative coefficient, has been studied and convergence rates have been provided for the very general case of nonconvex coercive objective functions that have the KL property. The algorithms, considered in comparison to Algorithm, are: A form of Nesterov s accelerated gradient method { xn+ = y n s g(y n ), y n = x n + n n + 3 (x (37) n x n ), A Nesterov-like algorithm for nonconvex optimization { xn+ = y n s g(y n ), y n = x n + γn n + 3 (x n x n ), with γ (0, ). We start our experiments by testing Algorithm in the simplest case of a one-dimensional, quadratic function g(x) = x 2. Here, the comparison between Algorithm and Nesterov s accelerated gradient (see [8]), with both methods sharing the same step-size s = 0. shows that, although initially slower, at some point Algorithm surpasses Nesterov s method, as can be observed in Fig. a). The same behaviour can be observed also in a higher-dimensional setting, e.g., g(x, x 2 ) = 0.02x 2 + 0.005x 2 2, with the decay of both the energy error and the error in terms of iterates presented in Fig. c) and d), respectively. The difference of order between the decay in terms of the objective function and that of the norm of the iterates proved in Theorem 3 is confirmed by these two pictures. On the other hand, Fig. b) shows a comparison between the trajectories generated by the two algorithms, in the x x 2 -plane. A second set of numerical experiments is related to the minimization of the nonconvex, coercive function g : R R g(x) = ln( + (x 2 ) 2 ) depicted in Fig. 2 a). When the starting point lies within the convexity domain of g (e.g., x 0 = y 0 = 3, in Fig. 2 b)), Algorithm converges faster than a Nesterov-like algorithm, after a certain number of iterates. However, outside the convexity domain of g, e.g., for x 0 = y 0 = 0.0, the Nesterov-like algorithm clearly outperforms Algorithm (see Fig. 2 c)). Nevertheless, the very general structure of the generic algorithm () allows for much flexibility, as only the limit of the sequence (α n ) is prescribed. So, one can profit by switching between coefficients of different nature as is the case with the following modified version of Algorithm Algorithm 2 x n+ = y n ( s g(y n ), y n = x n + ( λ n ) λ n = ( 0. n n + 3 + exp( + (5n 20)), ) ) n + λ n (x n x n ), n + 3 which first copies the Nesterov-like algorithm and then, after a number of iterates, switches to Algorithm, thus outperforming the Nesterov-like algorithm (see Fig. 2 d)). (38) (39) 9

0 20 40 60 80 00 20 0 5 Nesterov 0 2 Nesterov 0 2 Nesterov-like Algorithm 0 5 Nesterov 0 Algorithm 0 0 8 6 0-5 4 0-0 2 0-5 0-20 0 Nesterov Algorithm -2-2 0 2 4 6 8 0 (a) g(x) = x 2, x 0 = 3, s = 2 (b) g = 0.02x 2 + 0.005x 2 2, x 0 = (3, ), s = 2 0 0 Algorithm 0 0 Algorithm 0-2 0-5 0-4 0-0 0-6 0-5 0-8 0-20 0-0 0-25 0 500 000 500 0-2 0 500 000 500 (c) g = 0.02x 2 + 0.005x 2 2, x 0 = (3, ), s = 2 (d) g = 0.02x 2 + 0.005x 2 2, x 0 = (3, ), s = 2 Figure : Minimizing quadratic functions in dimension one (in a) and two (in b-d)). 7 6 0 0 Algorithm 5 0-2 4 0-4 3 0-6 0-8 2 0-0 0-2 0-4 -3-2 - 0 2 3 4 0-4 0 5 0 5 20 25 (a) g : R R, g(x) = ln( + (x 2 ) 2 ) (b) x 0 = 3 with s = 0. 0 0 0 0 0-2 0-2 0-4 0-4 0-6 0-6 0-8 0-8 0-0 0-0 0-2 0-2 Nesterov-like Algorithm Algorithm 0-4 0 5 0 5 20 25 30 35 40 0-4 Nesterov-like Algorithm Algorithm 2 0-6 0 5 0 5 20 25 30 (c) x 0 = 0.0 with s = 0. (d) x 0 = 0.0 with s = 0. Figure 2: Minimizing the nonconvex function g(x) = ln( + (x 2 ) 2 ) by using different algorithms and different starting points. 20

References [] H. Attouch, J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming 6(-2) Series B, 5-6, 2009 [2] H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka- Lojasiewicz inequality, Mathematics of Operations Research 35(2), 438-457, 200 [3] H. Attouch, J. Bolte, B.F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods, Mathematical Programming 37(-2) Series A, 9-29, 203 [4] H. Attouch, Z. Chbani, J. Peypouquet, P. Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity, Math. Program. 68(-2) Ser. B, 23-75, 208 [5] P. Bégout, J. Bolte, M.A. Jendoubi, On damped second-order gradient systems, Journal of Differential Equations (259), 35-343, 205 [6] J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming Series A (46)(-2), 459-494, 204 [7] J. Bolte, A. Daniilidis, A. Lewis, The Lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization 7(4), 205-223, 2006 [8] J. Bolte, A. Daniilidis, A. Lewis, M. Shiota, Clarke subgradients of stratifiable functions, SIAM Journal on Optimization 8(2), 556-572, 2007 [9] J. Bolte, A. Daniilidis, O. Ley, L. Mazet, Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society 362(6), 339-3363, 200 [0] R.I. Boţ, E.R. Csetnek, S.C. László, Approaching nonsmooth nonconvex minimization through second-order proximal-gradient dynamical systems, Journal of Evolution Equations 8(3), 29-38, 208 [] R.I. Boţ, E.R. Csetnek, S.C. László, An inertial forward-backward algorithm for minimizing the sum of two non-convex functions, Euro Journal on Computational Optimization 4(), 3-25, 206 [2] R.I. Boţ, E.R. Csetnek, S.C. László, A second order dynamical approach with variable damping to nonconvex smooth minimization, Applicable Analysis (208), https://doi.org/0.080/000368.208.495330 [3] P. Frankel, G. Garrigos, J. Peypouquet, Splitting Methods with Variable Metric for Kurdyka Lojasiewicz Functions and General Convergence Rates, Journal of Optimization Theory and Applications, 65(3), 874900, 205 [4] K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l institut Fourier (Grenoble) 48(3), 769-783, 998 [5] S.C. László, Convergence rates for an inertial algorithm of gradient type associated to a smooth nonconvex minimization, https://arxiv.org/abs/807.00387 2

[6] G. Li, T. K. Pong, Calculus of the Exponent of Kurdyka- Lojasiewicz Inequality and Its Applications to Linear Convergence of First-Order Methods, Foundations of Computational Mathematics, -34, 208 [7] S. Lojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles, Éditions du Centre National de la Recherche Scientifique Paris, 87-89, 963 [8] Y.E. Nesterov, A method for solving the convex programming problem with convergence rate O(/k 2 ), (Russian) Dokl. Akad. Nauk SSSR 269(3), 543-547, 983 [9] Y. Nesterov, Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Dordrecht, 2004 [20] B. T. Polyak, Some methods of speeding up the convergence of iteration methods, U.S.S.R. Comput. Math. Math. Phys., 4(5):-7, 964 [2] R.T. Rockafellar, R.J.-B. Wets, Variational Analysis, Fundamental Principles of Mathematical Sciences 37, Springer-Verlag, Berlin, 998 [22] W. Su, S. Boyd, E.J. Candes, A differential equation for modeling Nesterov s accelerated gradient method: theory and insights, Journal of Machine Learning Research, 7, -43, 206 22