Sequential convex programming,: value function and convergence

Size: px

Start display at page:

Download "Sequential convex programming,: value function and convergence"

Ethel Lawrence
5 years ago
Views:

1 Sequential convex programming,: value function and convergence Edouard Pauwels joint work with Jérôme Bolte Journées MODE Toulouse March / 16

2 Introduction Local search methods for finite dimensional nonconvex optimization. 2 / 16

3 Introduction Local search methods for finite dimensional nonconvex optimization. Sequential convex programming (SCP) x k+1 = arg min y h(y, x k ) 2 / 16

4 Introduction Local search methods for finite dimensional nonconvex optimization. Sequential convex programming (SCP) x k+1 = arg min y h(y, x k ) Convergence of the iterates to a critical point? 2 / 16

5 Introduction Local search methods for finite dimensional nonconvex optimization. Sequential convex programming (SCP) x k+1 = arg min y h(y, x k ) Convergence of the iterates to a critical point? Non convergence Jittering Convergence k / 16

6 The prox-friendly setting: Non-smooth part (constraints or more general). 3 / 16

7 The prox-friendly setting: Non-smooth part (constraints or more general). Favorable geometries Proximal splitting : Bruck, Pasty, Combettes, Pesquet, Nesterov, Beck, Teboulle, Eckstein, Tseng... 3 / 16

8 The prox-friendly setting: convergence results Non-smooth part (constraints or more general). Favorable geometries Proximal splitting : Bruck, Pasty, Combettes, Pesquet, Nesterov, Beck, Teboulle, Eckstein, Tseng... Convergence with semi-algebraic data: Bolte, Attouch, Noll, Rondepierre, Chouzenoux, Pesquet, Repetti, Lewis, Sabach, Teboulle... 3 / 16

9 The prox-friendly setting: Limitations Non-smooth part (constraints or more general). Favorable geometries Proximal splitting : Bruck, Pasty, Combettes, Pesquet, Nesterov, Beck, Teboulle, Eckstein, Tseng... Convergence with semi-algebraic data: Bolte, Attouch, Noll, Rondepierre, Chouzenoux, Pesquet, Repetti, Lewis, Sabach, Teboulle... Not all problems have tractable proximal operators 3 / 16

10 The prox-friendly setting: Limitations Non-smooth part (constraints or more general). Favorable geometries Proximal splitting : Bruck, Pasty, Combettes, Pesquet, Nesterov, Beck, Teboulle, Eckstein, Tseng... Convergence with semi-algebraic data: Bolte, Attouch, Noll, Rondepierre, Chouzenoux, Pesquet, Repetti, Lewis, Sabach, Teboulle... Not all problems have tractable proximal operators Complex geometries nonsmoothness / constraints must be approximated: LP, QP, SDP: convex programming oracles. Large field (SQP, SQCQP, Gauss-Newton,... ). prox-friendly analysis does not apply directly: More approximations sources of oscilation. convergence barely understood. 3 / 16

11 The prox-friendly setting: Limitations Non-smooth part (constraints or more general). Favorable geometries Proximal splitting : Bruck, Pasty, Combettes, Pesquet, Nesterov, Beck, Teboulle, Eckstein, Tseng... Convergence with semi-algebraic data: Bolte, Attouch, Noll, Rondepierre, Chouzenoux, Pesquet, Repetti, Lewis, Sabach, Teboulle... Not all problems have tractable proximal operators Complex geometries nonsmoothness / constraints must be approximated: LP, QP, SDP: convex programming oracles. Large field (SQP, SQCQP, Gauss-Newton,... ). prox-friendly analysis does not apply directly: More approximations sources of oscilation. convergence barely understood. Convergence to critical points in non-prox-friendly settings? 3 / 16

12 Outline 1. Existing results: gradient methods with semi-algebraic data 2. Complex geometries: sequential convex programming 3. Implicit gradient steps: the value function 4 / 16

13 Favorable geometries: gradient methods Gradient descent: min x f (x), (f smooth) s(x k x k+1 ) = f (x k ) 5 / 16

14 Favorable geometries: gradient methods Gradient descent: min x f (x), (f smooth) s(x k x k+1 ) = f (x k ) Proximal point: min x g(x) (g non-smooth) x k+1 prox g/s (x k ) s(x k x k+1 ) g(x k+1 ) (arg min y g(y)/s y x k 2 2) 5 / 16

15 Favorable geometries: gradient methods Gradient descent: min x f (x), (f smooth) s(x k x k+1 ) = f (x k ) Proximal point: min x g(x) (g non-smooth) x k+1 prox g/s (x k ) s(x k x k+1 ) g(x k+1 ) (arg min y g(y)/s y x k 2 2) Forward-Backward: min x f (x) + g(x) (f smooth, g non-smooth) x k+1 prox g/s (x k 1/s f (x k )) s(x k x k+1 ) g(x k+1 ) + f (x k ) 5 / 16

16 Favorable geometries: gradient methods Gradient descent: min x f (x), (f smooth) s(x k x k+1 ) = f (x k ) Proximal point: min x g(x) (g non-smooth) x k+1 prox g/s (x k ) s(x k x k+1 ) g(x k+1 ) (arg min y g(y)/s y x k 2 2) Forward-Backward: min x f (x) + g(x) (f smooth, g non-smooth) x k+1 prox g/s (x k 1/s f (x k )) s(x k x k+1 ) g(x k+1 ) + f (x k ) Relation between x k x k+1 and (sub)-gradient of the objective. 5 / 16

17 KL property ( Lojasiewicz 63, Kurdyka 98) Desingularizing functions on (0, r 0 ) ϕ C([0, r 0 ), R + ), ϕ C 1 (0, r 0 ), ϕ > 0, ϕ concave and ϕ(0) = 0. phi x 6 / 16

18 KL property ( Lojasiewicz 63, Kurdyka 98) Desingularizing functions on (0, r 0 ) ϕ C([0, r 0 ), R + ), ϕ C 1 (0, r 0 ), ϕ > 0, ϕ concave and ϕ(0) = 0. phi x Definition F 0 has the KL property at x (F 0 ( x) = 0) if there exists ɛ > 0 and a desingularizing function ϕ such that, dist( (ϕ F 0 )(x), 0) 1, x, x x ɛ, F 0 ( x) < F 0 (x) < F 0 ( x) + ɛ. 6 / 16

19 Illustration F 0 and ϕ F 0 F 0 and ϕ F Parameterize with ϕ sharpens the function / 16

20 Illustration F 0 and ϕ F 0 F 0 and ϕ F Parameterize with ϕ sharpens the 5 function Theorem (2006, Bolte-Daniilidis-Lewis) KL inequality holds for all lower-semicontinuous - semi-algebraic functions (and many more). 7 / 16

21 Finite length property (general recipe) (Attouch, Bolte, Svaiter, Sabach, Teboulle)... A, B > 0: Sufficient decrease: f (x k+1 ) + A x k+1 x k 2 f (x k ) Step length: B f (x k ) x k+1 x k 8 / 16

22 Finite length property (general recipe) (Attouch, Bolte, Svaiter, Sabach, Teboulle)... A, B > 0: Sufficient decrease: f (x k+1 ) + A x k+1 x k 2 f (x k ) Step length: B f (x k ) x k+1 x k Tameness: f is semi-algebraic (KL inequality) 8 / 16

23 Finite length property (general recipe) (Attouch, Bolte, Svaiter, Sabach, Teboulle)... A, B > 0: Sufficient decrease: f (x k+1 ) + A x k+1 x k 2 f (x k ) Step length: B f (x k ) x k+1 x k Tameness: Coercivity f is semi-algebraic (KL inequality) {x; f (x) f (x 0 )} is compact. 8 / 16

24 Finite length property (general recipe) (Attouch, Bolte, Svaiter, Sabach, Teboulle)... A, B > 0: Sufficient decrease: f (x k+1 ) + A x k+1 x k 2 f (x k ) Step length: B f (x k ) x k+1 x k Tameness: Coercivity f is semi-algebraic (KL inequality) {x; f (x) f (x 0 )} is compact. Finite length: x k+1 x k is bounded. {x k } converges to a critical point. 8 / 16

25 Finite length property (general recipe) (Attouch, Bolte, Svaiter, Sabach, Teboulle)... A, B > 0: Sufficient decrease: f (x k+1 ) + A x k+1 x k 2 f (x k ) Step length: B f (x k ) x k+1 x k Tameness: Coercivity f is semi-algebraic (KL inequality) {x; f (x) f (x 0 )} is compact. Finite length: x k+1 x k is bounded. {x k } converges to a critical point. Remark: There exist counterexamples for functions which are not semi-algebraic. 8 / 16

26 Outline 1. Existing results: gradient methods with semi-algebraic data 2. Complex geometries: sequential convex programming 3. Implicit gradient steps: the value function 9 / 16

27 Examples in non linear programming Approximate local models: approximation and localization Exact penalization: min x f (x) + β max {f i (x)} i=0...m (f, f i smooth). x k+1 = arg min y f (x k ) + f (x k ), y x k +β max {f i (x k ) + f i (x k ), y x k } i=1...m + s y x k / 16

28 Examples in non linear programming Approximate local models: approximation and localization Exact penalization: min x f (x) + β max {f i (x)} i=0...m (f, f i smooth). x k+1 = arg min y f (x k ) + f (x k ), y x k +β max {f i (x k ) + f i (x k ), y x k } i=1...m + s y x k 2 2 Moving ball: min x f (x) s.t. max i=1...m f i (x) 0. (f, f i smooth). x k+1 = arg min y f (x k ) + f (x k ), y x k + s y x k 2 2 s.t. max i=1...m f i(x k ) + f i (x k ), y x k + s y x k / 16

29 Examples in non linear programming Approximate local models: approximation and localization Exact penalization: min x f (x) + β max {f i (x)} i=0...m (f, f i smooth). x k+1 = arg min y f (x k ) + f (x k ), y x k +β max {f i (x k ) + f i (x k ), y x k } i=1...m + s y x k 2 2 Moving ball: min x f (x) s.t. max i=1...m f i (x) 0. (f, f i smooth). x k+1 = arg min y f (x k ) + f (x k ), y x k + s y x k 2 2 s.t. max i=1...m f i(x k ) + f i (x k ), y x k + s y x k Gauss-Newton: min x g(f (x)) (F smooth, g convex) x k+1 = arg min y g(f (x k ) + F (x k )(y x k )) + s y x k / 16

30 A gradient method? Objective: val 4 2 type Objective Approximation Approximation: H(x) = max i=1...m f i(x) x h s(y, x) = max i=1...m f i(x)+ f i (x), y x +s y x / 16

31 A gradient method? Objective: val 4 2 type Objective Approximation Approximation: H(x) = max i=1...m f i(x) x h s(y, x) = max i=1...m f i(x)+ f i (x), y x +s y x 2 2 x k+1 = arg min y h s (y, x k ): a gradient method? 11 / 16

32 A gradient method? Objective: val 4 2 type Objective Approximation Approximation: H(x) = max i=1...m f i(x) x h s(y, x) = max i=1...m f i(x)+ f i (x), y x +s y x 2 2 x k+1 = arg min y h s (y, x k ): a gradient method? Main difficulty: track activity I (x) := arg max i f i (x) (subgradients of H). I (x k+1 ) and I (x k ) are very hard to connect. No relation between x k+1 x k and elements in H(x k ) or H(x k+1 ). Same issues for all the previous methods. 11 / 16

33 Outline 1. Existing results: gradient methods with semi-algebraic data 2. Complex geometries: sequential convex programming 3. Implicit gradient steps: the value function 12 / 16

34 Toward a link between SCP and gradient methods Gradient descent is an SCP method: s(x k x k+1 ) = f (x k ) x k+1 = arg min y f (x k ) + f (x k ), y x k + s 2 y x k / 16

35 Toward a link between SCP and gradient methods Gradient descent is an SCP method: s(x k x k+1 ) = f (x k ) x k+1 = arg min y f (x k ) + f (x k ), y x k + s 2 y x k 2 2 An identity from Moreau: g convex, lower semicontinuous, G : x min y g(y) y x 2 2 (value function of prox g ). x k+1 = prox g (x k ) x k x k+1 = G(x k ) (arg min y g(y) y x k 2 2) 13 / 16

36 Toward a link between SCP and gradient methods Gradient descent is an SCP method: s(x k x k+1 ) = f (x k ) x k+1 = arg min y f (x k ) + f (x k ), y x k + s 2 y x k 2 2 An identity from Moreau: g convex, lower semicontinuous, G : x min y g(y) y x 2 2 (value function of prox g ). x k+1 = prox g (x k ) x k x k+1 = G(x k ) (arg min y g(y) y x k 2 2) A prox step is an implicit gradient step on its value function. Can we extend to more general SCP? 13 / 16

37 SCP: strongly convex tangent approximation, example Objective: (each f i is C 2, semi-algebraic) val 4 2 type Objective Approximation Approximation: H(x) = max i=1...m f i(x) x h s(y, x) = max i=1...m f i(x)+ f i (x), y x +s y x 2 2 x k+1 = arg min y h s(y, x k ) (P s(x)) 14 / 16

38 SCP: strongly convex tangent approximation, example Objective: (each f i is C 2, semi-algebraic) val 4 2 type Objective Approximation Approximation: H(x) = max i=1...m f i(x) x h s(y, x) = max i=1...m f i(x)+ f i (x), y x +s y x 2 2 x k+1 = arg min y h s(y, x k ) (P s(x)) The value function: V s (x) = value of P s (x) Critical points of V s are exactly critical points of H. dist (0, V s(x k )) C x k+1 x k locally V s(x k ) + D x k x k+1 2 V s(x k 1 ) locally (for suitable s). V s is semi-algebraic nonsmooth KL property. 14 / 16

39 SCP: strongly convex tangent approximation, example Objective: (each f i is C 2, semi-algebraic) val 4 2 type Objective Approximation Approximation: H(x) = max i=1...m f i(x) x h s(y, x) = max i=1...m f i(x)+ f i (x), y x +s y x 2 2 x k+1 = arg min y h s(y, x k ) (P s(x)) The value function: V s (x) = value of P s (x) Critical points of V s are exactly critical points of H. dist (0, V s(x k )) C x k+1 x k locally V s(x k ) + D x k x k+1 2 V s(x k 1 ) locally (for suitable s). V s is semi-algebraic nonsmooth KL property. Implicit (sub)-gradient step on the value function back to charted territory. 14 / 16

40 Non linear programming with semi-algebraic data SQP and SQCQP from (Bolte-P. 2014): General convergence result Exact penalization SQP: Sl 1 -QP (Fletcher 1985), ESQM (Auslender 2013). Inner approximating methods: Moving Ball (Auslender et. al (2010)). Ongoing work (P. 2016): Composite Gauss-Newton (Burke 1985): min x g(f (x)), g : R m R convex finite valued, F : R n R m C 2. x k+1 = arg min y g(f (x k ) + F (x k )(y x k )) + s y x k / 16

41 Conclusion Non convergence Jittering Convergence k / 16

42 Conclusion Non convergence Jittering Convergence k First general convergence result for SCP methods (complex geometry). Abstract SCP: strongly convex tangent approximations of tame objective. implicit subgradient method on the value function. More details: J. Bolte and E. Pauwels. Majorization-minimization procedures and convergence of SQP methods for semi-algebraic and tame programs. MOR / 16

A semi-algebraic look at first-order methods

splitting A semi-algebraic look at first-order Université de Toulouse / TSE Nesterov s 60th birthday, Les Houches, 2016 in large-scale first-order optimization splitting Start with a reasonable FOM (some