A user s guide to Lojasiewicz/KL inequalities

Size: px

Start display at page:

Download "A user s guide to Lojasiewicz/KL inequalities"

Morgan Norman Foster
6 years ago
Views:

1 Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015

2 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f (x k ) Other KL answers a counterintuitive phenomenon of gradient systems. Bounded curves/sequences of with f C may oscillate indefinitely (for any choices of steps...) and generate paths with INFINITE LENGTH

3 Smooth counter-examples: Palis-De Melo, Absil et al. ẋ(t) = f (x(t)) Gradient: x k+1 = x k λ k f (x k ) Idea: Spiraling bump yields spiraling curves Other ( f (r, θ) = e (1 r 2 ) r 4 ) ( 1 r 4 + (1 r 2 ) 4 sin θ 1 ) 1 r 2, r 1

4 Counter-examples (continued) Other Oscillations in Optimization: Very bad predictions arbitrarily bad complexity Even worst behaviors for nonsmooth Same awful behaviors for more complex : e.g. Forward-Backward Solutions to these bad behaviors Topological approach: to avoid oscillations use (or definable) Ad hoc approach: study deformations of a class of nonsmooth sharp function

5 Definition : Other A subset of R n is a finite union of sets of the form {x R n : f i (x) = 0, g j (x) < 0, i I }, where I, J are finite and f i, g j : R n R are real polynomial. Stability by finite,, and complementation. A function or a mapping is if its graph is a set (same definition for real-extended function or multivalued mappings)

6 How to recognize sets /? Other Theorem (Tarski-Seidenberg) The image of a set by a linear projection is. Example: Let A be a subset of R n and f : R n R p. Then f (A) is. Proof. Then f (A) = {y R p : x A, y = f (x)}= Π x (graph f (A R p )), where Π x (x, y) = y for all (x, y) R n R p.

7 How to recognize sets /? Other Theorem (Tarski-Seidenberg) The image of a set by a linear projection is. Example: Let A be a subset of R n and f : R n R p. Then f (A) is. Proof. Then f (A) = {y R p : x A, y = f (x)}= Π x (graph f (A R p )), where Π x (x, y) = y for all (x, y) R n R p.

8 Checking ity is easy Other Typical illustration: The closure of a set A is semialgebraic. Proof. Observe that { A = Set then y R n : ɛ ]0, + [, x A, B = {(x, ɛ, y) R n ]0, + [ R n : where } n (x i y i ) 2 < ɛ 2. i=1 n (x i y i ) 2 < ɛ 2 } i=1 A = R n \ Π y ((R n R n \ Π ɛ (B))) Π ɛ (x, ɛ, y) = (x, y) Π y (x, y) = x.

9 Checking ity is easy Other Typical illustration: The closure of a set A is semialgebraic. Proof. Observe that { A = Set then y R n : ɛ ]0, + [, x A, B = {(x, ɛ, y) R n ]0, + [ R n : where } n (x i y i ) 2 < ɛ 2. i=1 n (x i y i ) 2 < ɛ 2 } i=1 A = R n \ Π y ((R n R n \ Π ɛ (B))) Π ɛ (x, ɛ, y) = (x, y) Π y (x, y) = x.

10 Other examples Other Other examples: Is the derivative? F : U R n R m, L = df (x) f (x + h) = f (x) + Lh + o( h ) ɛ > 0, δ > 0, graph df = {(x, L) U R n m, L = df (x)} ( h < δ, x + h U) f (x + h) f (x) Lh < ɛ h ) Exercise: show that g(x) = max{f (x, y) : y K} is whenever f, K.

11 Examples Other examples: Other min 1 2 Ax b 2 + λ x p (p rational) { } 1 min 2 A X B 2 + λrank (X ) : X matrix { } 1 min 2 AB M 2 : A 0, B 0 { } 1 min 2 AB M 2 : A 0, B 0, A 0 r, B s...

12 Fréchet subdifferential Other f : R n R {+ }. First-order formulation: Let x in dom f (Fréchet) subdifferential : p ˆ f (x) iff f (u) f (x) + p, u x + o( u x ), u R n. u f (u) u f (x) + p.u x

13 Subdifferential Other Definition (Subdifferential) It is denoted by f and defined through: x f (x) iff (x k, x k ) (x, x ) such that f (x k ) f (x) and f (u) f (x k ) + x k, u x k + o( u x k ). Example: f (x) = x 1, f (0) = B = [ 1, 1] n Set f (x) = min{ x : x f (x)} Properties (Critical point) Fermat s rule if f has a minimizer at x then f (x) 0. Conversely when 0 f (x), the point x is called critical.

14 An elementary remedy to gradient oscillation : Sharpness A function f : R n R {+ } is called sharp on the slice [r 0 < f < r 1 ] := {x R n : r 0 < f (x) < r 1 }, if there exists c > 0 f (x) c > 0, x [r 0 < f < r 1 ] Basic example f (x) = x Other Many works since 78, Rockafellar, Polyak, Ferris, Burke and many others..

15 Sharpness: example Nonconvex illustration with a continuum of minimizers Other

16 Finite convergence Other Why is sharpness interesting? Gradient curves reach the valley within a finite time. Easy to see the phenomenon on proximal descent Prox descent = formal setting for implicit gradient x + = x step. f (x + ) Prox operator: prox s f x = argmin{f (u) + 1 2s u x 2 : u R n } (Moreau) x + = prox step f (x) When f is sharp, convergence occurs in finite time!!

17 Proof Assume f is sharp, then if x k+1 f (x k+1 ) f (x k ) δ. is non critical Other Write (assume step is constant) f (x k+1 ) f (x k ) 1 2.step x k+1 x k 2 x k+1 x k step. f (x k+1 ) thus x k+1 x k c.step f (x k+1 ) f (x k ) c2.step 2 Set δ = 0.5c 2.step.

18 Measuring the default of sharpness Other 0 is a critical value of f (true up to a translation). Set [0 < f < r 0 ] := {x R n : 0 < f (x) < r 0 } and assume that there are no critical points in [0 < f < r 0 ]. f has the KL property on [0 < f < r 0 ] if there exists a function g : [0 < f < r 0 ] R {+ } whose sublevel sets are those of f, ie [f r] r (0,r0 ), and such that g(x) 1 for all x in [0 < f < r 0 ]. EX (Concentric circles) f (x) = x 2 and g(x) = x

19 Formal KL property Other Formally : Desingularizing on (0, r 0 ) : ϕ C([0, r 0 ), R + ), concave, ϕ C 1 (0, r 0 ), ϕ > 0 and ϕ(0) = 0. Definition f has the KL property on [0 < f < r 0 ] if there exists a desingularizing function ϕ such that (ϕ f )(x) 1, x [0 < f < r 0 ]. Local version : replace [0 < f < r 0 ] by the intersection of [0 < f < r 0 ] with a closed ball.

20 Key results Other Theorem [ Lojasiewicz (Hörmander) 1968; Kurdyka 98] If f is analytic or smooth and semialgebraic, it satisfies the Lojasiewicz inequality around each point of R n. Nonsmooth semialgebraic/subanalytic case 2006: [B-Daniilidis-Lewis] Take f : R n R {+ } lower semicontinuous and semialgebraic then f has KL property around each point. Many many satisfy KL inequality (B-Daniilidis- Lewis-Shiota 2007, Attouch-B-Redont ) More to come

21 Descent at large Other Let f : R n R {+ } be a proper lower semicontinuous function; a, b > 0. Let x k be a sequence in dom f such that Sufficient decrease condition f (x k+1 ) + a x k+1 x k 2 f (x k ); k 0 Relative error condition For each k N, there exists w k+1 f (x k+1 ) such that w k+1 b x k+1 x k ;

22 An abstract convergence theorem Other Theorem (Attouch-B-Svaiter 2012) Let f be a KL function and x k a descent sequence for f. If x k is bounded then it converges to a critical point of f. Corollary Let f be a coercive function and x k a descent sequence for f. Then x k converges to a critical point of f.

23 Rate of convergence Other Let x the (unique) limit point of x k. Assume that ϕ(s) = cs 1 θ with c > 0, θ [0, 1) (cover case). Theorem (Attouch-B 2009) (i) If θ = 0, the sequence (x k ) k N converges in a finite number of steps, (ii) If θ (0, 1 2 ] then there exist c > 0 and Q [0, 1) such that x k x c Q k, (iii) If θ ( 1 2, 1) then there exists c > 0 such that x k x c k 1 θ 2θ 1.

24 (Nonconvex) forward-backward splitting algorithm Minimizing nonsmooth+smooth structure: f = g + h Other with { h C 1 and h L Lipschitz continuous g lsc bounded from below + prox is easily computable Forward-backward splitting (Lions-Mercier 79): Let γ k be such that 0 < γ < γ k < γ < 1 L x k+1 prox γk g (x k γ k h(x k )) Theorem If the problem is coercive x k is a converging sequence

25 Sparse solutions of under-determined systems Other min{λ x Ax b 2 }, (Blumensath Davis 2009) Forward-backward splitting ) x k+1 prox γk λ 0 (x k γ k (A T Ax k A T b) where 0 < γ < γ k < γ < 1 A T A F, Theorem (Attouch-B-Svaiter) The sequence x k converges to a critical point of λ x Ax b 2

26 Low rank solutions of under-determined systems Other rank M= rank of M or a (or definable) surrogate of the rank whose prox is easily computable. See Hiriart-Urruty talk. min {λrank (M) + 12 } A M B 2, where A = linear operator Forward-backward splitting M k+1 prox γk λrank where 0 < γ < γ k < γ < 1 A T A F, ( ) x k γ k (A T A M k A T B) Theorem The sequence M k converges to a critical point of λrank M A M B 2

27 Gradient projection Other C a set, f a function. ) x k+1 P C (x k γ k f (x k ) Theorem The sequence x k converges to a critical point of the problem min C f

28 Gradient projection (Example) Other 1 2 A M B 2 + i Cs (x) with C s the set of matrix of at most rank s. P Cs M is known in closed form threhsold the inner diagonal matrix in the SVD Gradient-Projection method: ) M k+1 P Cs (M k γ k (A T A M k A T B) where γ k (ɛ, 1 A T A F ). No oscillations, M k converges to a critical point!!

29 Another illustration: Averaged projections Other x k+1 (1 θ) x k + θ ( 1 p ) p P Fi (x k ) + ɛ k, i=1 where θ (0, 1) and ɛ k M x k+1 x k Theorem (Inexact averaged projection method) F 1,..., F p be with C 2 boundaries or convex which satisfy p i=1 F i. If x 0 is sufficiently close to p i=1 F i, then x k converges to a feasible point x, i.e. such that x p F i. i=1

30 Alternating versions of prox algorithm Other F : R m 1... R mp R {+ } lsc Structure of F : p F (x 1,..., x p ) = f i (x i ) + Q(x 1,..., x p ) i=1 f i are proper lsc and Q is C 1. Gauss-Seidel method/ Prox (Auslender 1993, Grippo-Sciandrone 2000) x k+1 1 argmin u R m 1 x k+1 p F (u, x k 2,..., x k p ) µ 1 k u x 1 k 2 argmin F (x k+1 u R mp 1,..., xp 1 k+1, u) + 1 2µ p u x p k 2 k

31 Alternating versions of prox algorithm Other F : R m 1... R mp R {+ } lsc Structure of F : p F (x 1,..., x p ) = f i (x i ) + Q(x 1,..., x p ) i=1 f i are proper lsc and Q is C 1. Gauss-Seidel method/ Prox (Auslender 1993, Grippo-Sciandrone 2000) x k+1 1 argmin u R m 1 x k+1 p F (u, x k 2,..., x k p ) µ 1 k u x 1 k 2 argmin F (x k+1 u R mp 1,..., xp 1 k+1, u) + 1 2µ p u x p k 2 k

32 Convergence Other Theorem (Attouch-B.-Redont-Soubeyran 2010, B.-Combettes-Pesquet, 2010) Assume that F is a KL function (e.g. f i, Q are ), then the sequence (x k 1,..., x k p ) satisfies either (x k 1,..., x k p ) + (x k 1,..., x k p ) converges to a critical point of F Note that convergence automatically happens if either one of the f i or Q is coercive.

33 Proximal alternating linearized method Other F (x) = Q(x 1, x 2 ) + f 1 (x 1 ) + f 2 (x 2 ), x 1, Q(x 1, ) is C 1 with L 2 (x 1 ) > 0 Lipschitz continuous gradient (same for x 2 with L 1 (x 2 )) ( x1 k+1 prox 1 L 1 (x 2 k f x k ) x k+1 2 prox 1 L 2 (x k 1 ) f 1 Set x k+1 = (x k+1 1, x k+1 2 ). ) L 1 (x2 k) x 1 Q(x1 k, x2 k ) ( x k 1 1 L 2 (x k 1 ) x 2 Q(x k+1 1, x k 2 ). ). Theorem (Convergence of PALM (B., Teboulle, Sabach)) Any bounded sequence generated by PALM converges to a critical point of F.

34 Sparse Nonnegative Matrix Factorization Other Consider in NMF some sparsity measure as X 0 = number of nonzero entries { 1 min 2 A XY 2 F : X 0, X 0 r, } Y 0, Y 0 s

35 PALM for Sparse NMF Other 1. Take X 0 R m r + and Y 0 R+ r n. 2. Generate a sequence {( X k, Y k)} k N : 2.1. Take γ 1 > 1, set c k = γ 1 Y k ( Y k) T F and compute U k = X k 1 c k ( X k Y k A ) ( Y k) T ; X k+1 prox 1 c k ( U k ) = T α ( P+ ( U k )) Take γ 2 > 1, set d k = γ 2 X k+1 ( X k+1) T F and compute V k = Y k 1 d k ( X k+1 ) T ( X k+1 Y k A ) Y k+1 prox R2 d k ( V k ) = T β ( P+ ( V k )). We thus get the desired convergence result by our general theorem

36 Some complications: A SQP method Other We wish to solve min{f (x) : x R n, f i (x) 0} which can be written min{f + i C } with C = [f i 0, i]. We assume: f is L f Lipschitz i, f i is L fi Lipschitz continuous. Prox operator here are out of reach: too complex!!

37 Moving balls Other Classical Sequential Quadratic Programming (SQP) idea, replace by some simple approximations. Moving balls method (Auslender-Shefi-Teboulle, Math. Prog., 2010) min x R n f (x k) + f (x k )(x x k )+ L f 2 x x k 2 f 1 (x k ) + f 1(x k )(x x k )+ L f 1 2 x x k f m (x k ) + f m(x k )(x x k )+ L f m 2 x x k 2 0 Bad surprise: x k is not a descent sequence for f + i C.

38 Moving balls Introduce Other Theorem val(x) = min y R n f (x) + f (x)(y x)+ L f y x 2 2 f 1 (x) + f 1(x)(y x)+ L f 1 2 y x f m (x) + f m(x)(y x)+ L f m 2 y x 2 0 Good surprise: x k is a descent sequence for val. Assume Mangasarian-Fromovitz qualification condition. If x k is bounded it converges to a KKT point of the original problem.

A semi-algebraic look at first-order methods

splitting A semi-algebraic look at first-order Université de Toulouse / TSE Nesterov s 60th birthday, Les Houches, 2016 in large-scale first-order optimization splitting Start with a reasonable FOM (some