Advanced Continuous Optimization

Size: px

Start display at page:

Download "Advanced Continuous Optimization"

Florence Perry
6 years ago
Views:

1 University Paris-Saclay Master Program in Optimization Advanced Continuous Optimization J. Ch. Gilbert (INRIA Paris-Rocquencourt) October 17, 2017 Lectures: September 18, 2017 November 6, 2017 Examination: November 13, 2017 Keep a close eye to Outline 1 Background Convex analysis Nonsmooth analysis Optimization Algorithmics 2 Optimality conditions First order optimality conditions for (P G ) Second order optimality conditions for (P EI ) 3 Linearization methods The Josephy-Newton algorithm for inclusions The SQP algorithm in constrained optimization 4 Semidefinite optimization Problem definition Existence of solution and optimality conditions An interior point algorithm 2 / 110

2 Preliminaries Outline of the course Three parts (60h) 1 Advanced Continuous Optimization I (30h) Theory and algorithms in smooth nonconvex optimization Goal: improve the basic knowledge in optimization with some advanced concepts and algorithms Period: September, 18 November, 6 (examination: November, 13) 2 Advanced Continuous Optimization II (20h) Implementation of algorithms (in Matlab on an easy to implement concrete problem) Goal: improve algorithm understanding Period: November, 20 December, 18 (examination: January,??) 3 Specialized course by Jacek Gondzio (10h) 3 / 110 Preliminaries Some features of the course Smooth nonconvex optimization Dealing with the general optimization problem { infx E f(x) (P G ) c(x) G, where f : E R, c : E F, and G is a closed convex set in F. The latter includes the nonlinear optimization problem { infx R f(x) (P EI ) c E (x) = 0 and c I (x) 0, where f : E R, E and I form a partition of [1:m], c E : E R m E, and c I : E R m I are smooth (possibly non convex) functions, the linear semidefinite optimization problem { infx S n C,X (P SDO ) A(X) = b and X 0, where C S n, A : S n R m is linear, and b R m. 4 / 110

3 Preliminaries Some features of the course Finite dimension: E and F are Euclidean vector spaces The unit closed ball is compact A bounded sequence has a converging subsequence Nothing is said about optimal control problems ( see other courses) Emphasis is put on theory optimality conditions of problem (PG ) and many other topics, and algorithms their analysis (scope, convergence, speed of convergence, strong and weak features,... ). 5 / 110 Preliminaries Outline of the course Advanced Continuous Optimization I Anticipated program (we ll see...) 1 Background in convex analysis and optimization Convex analysis: dual cone, Farkas lemma, asymptotic cone,... Nonsmooth analysis: multifunction, subdifferential,... Optimization: optimality conditions,... 2 Abstract optimality conditions First order optimality conditions for (PG ) Second order optimality conditions for (PEI ) (or (P G )?) 3 Linearization methods (Newton) Josephy-Newton algorithm for nonlinear inclusions SQP algorithm for (PEI ) Semismooth Newton algorithm for nonlinear equations 6 / 110

4 Preliminaries Outline of the course Advanced Continuous Optimization I 4 Proximal and augmented Lagrangian methods Proximal algorithm for solving inclusions and optimization problems Augmented Lagrangian algorithm in optimization Application to convex quadratic optimization (error bound and global linear convergence) 5 Semidefinite optimization theory (optimality conditions, duality,... ), algorithm (interior point methods,... ). 7 / 110 Preliminaries Outline of the course Advanced Continuous Optimization II Goal: mastering the implementation of an algorithm Choose one of the following topics (20h) 1 Implementation of an SQP algorithm, application to the hanging chain problem 2 Implementation of an SDO algorithm, application to the OPF problem (Rank relaxation of its QCQP modelling) 8 / 110

5 Background Convex analysis (notation) R := R {,+ }. R + := {t R : t 0} and R ++ := {t R : t > 0}. B, B: open and closed unit balls centered at the origin; for r 0: B(x,r) = x +rb and B(x,r) = x +r B. E, F, G usually denote Euclidean vector spaces. 10 / 110 Background Convex analysis (relative interior I) The affine hull of P E is the smallest affine space containing P: affp := {A : A is an affine space containing P}. The relative interior of P E is its interior in affp: rip := {x P : r > 0 such that [B(x,r) affp] P}. In finite dimension, there holds C convex and nonempty = { ric, affc = aff(ric). 11 / 110

6 Background Convex analysis (relative interior II) Proposition (relative interior criterion) Let C be a nonempty convex set and x E. Then x ric and y C = [x,y[ ric. x ric x 0 C (or affc), t > 1 : (1 t)x 0 +tx C. Let C be a nonempty convex set. Then ri C is convex, C is convex and affc = affc, ri C = C and ri C = ri C (i.e., the last operation prevails). A point x C is said absorbing if d E, t > 0 such that x +td C. x intc x is absorbing. 12 / 110 Background Convex analysis (dual cone and Farkas lemma) The (positive) dual cone of a set P E is defined by P + := {d E : d,x 0, x P}. The (negative) dual cone of a set P is P := P +. Lemma (Farkas, generalized) Let E and F be two Euclidean spaces, A : E F a linear map, and K a nonempty convex cone of E. Then A(K) = {y F : A y K + } +. A : F E is defined by: (x,y) E F, A y,x = y,ax. One cannot get rid of the closure on A(K) in general. If K is polyhedral, then A(K) is polyhedral, hence closed. For K = E, one recovers R(A) = N(A ). 13 / 110

7 Background Convex analysis (tangent and normal cones) Tangent cone Let C be a convex set of E and x C. The cone of feasible directions for C at x is T f x C := R + (C x). The tangent cone to C at x is the closure of the previous one T x C T C (x) = R + (C x). Normal cone Let C be a convex set of E and x C. The normal cone to C at x is There hold N x C N C (x) = {d E : x x,d 0, x C}. N x C = (T x C) and T x C = (N x C). 14 / 110 Background Convex analysis (asymptotic cone I) Let E be a vector space of finite dimension and C be a nonempty closed convex set of E. The asymptotic cone of C is C := {d E : C +R + d C} = {d E : C +d C}. Properties C is closed. For any x C: C = {d E : x +R + d C} = t>0 = C x t { d E : {x k } C, {t k } such that x k t k d Boundedness by calculation: C is bounded C = {0}. }. 15 / 110

8 Background Convex analysis (asymptotic cone II) Two calculation rules (there are many more) If K, then K is a closed convex cone K = K. For an arbitrary collection {Ci } i I of closed convex sets C i with nonempty intersection: ( i I C i ) = i I C i. Example Let A : E F and B : E G be linear maps, a F, b G, K G be a closed convex cone, and P = {x E : Ax = a, Bx b +K}. Then P = {d E : Ad = 0, Bd K}. 16 / 110 Background Convex analysis (convex polyhedron) A convex polyhedron in E is a set of the form P = {x E : Ax b}, where A : E R m is a linear map and b R m. It is a closed set. For x P, define I(x) := {i [1:m] : (Ax b) i = 0}. ({x k } x) = I(x k ) I(x) for large k. If T : E F is linear, then T(P) is a convex polyhedron. If P 1 and P 2 are polyhedra, then P 1 +P 2 is a polyhedron. T x P = T f x P = {d E : (Ad) I(x) 0}. I(x 1 ) I(x 2 ) = T x1 P T x2 P. N x P = cone{a e i : i I(x)}. I(x 1 ) I(x 2 ) = N x1 P N x2 P. 17 / 110

9 Background Nonsmooth analysis (multifunction I) A multifunction T (or set-valued mapping) between two sets E and F is a function from E to P(F), the set of the subsets of F. Notation: T : E F : x T(x) F. Same concept as a binary relation (i.e., the data of a part of E F). A graph, the domain, the range of T : E F are defined by G(T) := {(x,y) E F : y T(x)}, D(T) := {x E : (x,y) G(T) for some y F} = π E G(T), R(T) := {y F : (x,y) G(T) for some x E} = π F G(T). The image of a part P E by T is T(P) := x P T(x). 19 / 110 Background Nonsmooth analysis (multifunction II) The inverse of a multifunction T : E F (it always exists!) is the multifunction T 1 : F E defined by T 1 (y) := {x E : y T(x)}. Hence y T(x) x T 1 (y). When E, F are topological/metric spaces, a multifunction T : E F is said to be closed at x E if y T(x) when (xk,y k ) G(T) converges to (x,y), closed if G(T) is closed in E F (i.e., T is closed at any x E). upper semicontinuous at x E if ε > 0, δ > 0, x x +δb, there holds T(x ) T(x)+εB. 20 / 110

10 Background Nonsmooth analysis (multifunction III) When E, F are vector spaces, a multifunction T : E F is said to be convex if G(T) is convex in E F. This is equivalent to saying that (x 0,x 1 ) E 2 and t [0, 1]: Note that T((1 t)x 0 +tx 1 ) (1 t)t(x 0 )+tt(x 1 ). T convex and C convex in E = T(C) convex in F. 21 / 110 Background Optimization (a generic problem) One considers the generic optimization problem { min f(x) (P X ) x X. where f : E R (E is a Euclidean vector space), X is a set of E (possibly nonconvex). Définitions: solution or (global) minimum x X if x X, f(x ) f(x), local minimum x X if V N(x ), x X V, f(x ) f(x), strict local/global minimum x if f(x ) < f(x) above when x x. 23 / 110

11 Background Optimization (tangent cone to a nonconvex set) A direction d E is tangent to X E at x X (in the sense of Bouligand) if {x k } X, {t k } 0 : x k x t k d. The tangent cone to X at x (in the sense of Bouligand) is the set of tangent directions. It is denoted by Properties Let x X. T x X or T X (x). T x X is closed. X is convex = T x X is convex and T x X = R + (X x). 24 / 110 Background Optimization (Peano-Kantorovich NC1) Theorem (Peano-Kantorovich NC1) If x is a local minimizer of (P X ) and f is differentiable at x, then f(x ) (T x X) +. The gradient of f at x is denoted by f(x) E and is defined from the derivative f (x) by d E : f(x ),d = f (x) d. 25 / 110

12 Background Optimization (NSC1 for convex problems) Recall that a convex function f : E R {+ } has directional derivatives f (x;d) R for all x domf and all d E. Proposition (NSC1 for a convex problem) Suppose that X is convex, f is convex on X, and x X. Then x is a global solution to (P X ) if and only if x X : f (x ;x x ) 0. Proof. Straightforward, using the convexity inequality x X : f(x) f(x )+f (x ;x x ). 26 / 110 Background Optimization (problem (P E )) Let E, F be Euclidean vector spaces. The equality constrained problem is (P E ) { infx f(x) c(x) = 0, where f : E R, c : E F are smooth (possibly non convex) functions. The feasible set is denoted by X E := {x E : c(x) = 0}. (P E ) is said to be convex if f is convex and c E is affine. 27 / 110

13 Background Optimization (problem (P E ) Lagrange optimality conditions) Theorem (NC1 for (P E ), Lagrange, XVIIIth) If x is a local minimum of (P E ), if f and c are differentiable at x, and if c is qualified for representing X E at x in the sense (2) below, then there exists a multiplier λ F such that x l(x,λ ) = 0, c(x ) = 0. (1a) (1b) Some explanations. The constraint c is qualified for representing X E at x X E if T x X E = T x X E := N(c (x)). (2) Qualification holds of c (x) is surjective (sufficient condition of CQ). The Lagrangian of (P E ) is the function l : (x,λ) E F l(x,λ) = f(x)+ λ,c(x). 28 / 110 Background Optimization (problem (P E ) second order optimality conditions) Theorem (NC2 for (P E )) If x is a local minimum of (P E ), if f and c are twice differentiable at x, and if (1a) holds for some λ F, then d T x X E : 2 xx l(x,λ )d,d 0. (3) Inequality in (3) is not necessarily true for d N(c (x ))\T x X E. 2 xxl(x,λ ) is not necessarily positive semi-definite (even if qualification holds). Theorem (SC2 for (P E )) If f and c are twice differentiable at x, if (1) holds for some λ F, and if d T x X E \{0} : d T 2 xxl(x,λ )d > 0, (4) then x is a strict local minimum of (P E ). (4) stronger (hence conclusion holds) if inequality holds d N(c (x ))\{0}. 29 / 110

14 Background Optimization (problem (P EI )) A generic form of the nonlinear optimization problem: inf x f(x) (P EI ) c E (x) = 0 c I (x) 0, where f : E R, E and I form a partition of [1:m], c E : E R m E, and c I : E R m I are smooth (possibly non convex) functions. The feasible set is denoted by X EI := {x E : c E (x) = 0, c I (x) 0}. We say that an inequality constraint is active at x X EI if c i (x) = 0. The set of indices of active inequality constraints is denoted by I 0 (x) := {i I : c i (x) = 0}. (P EI ) is said to be convex if f is convex, c E is affine, and c I is componentwise convex. 30 / 110 Background Optimization (problem (P EI ) NC1 or KKT conditions) Theorem (NC1 for (P EI ), Karush-Kuhn-Tucker (KKT)) If x is a local minimum of (P EI ), if f and c = (c E,c I ) are differentiable at x, and if c is qualified for representing X EI at x in the sense (6) below, then there exists a multiplier λ R m such that Some explanations. x l(x,λ ) = 0, c E (x ) = 0, The Lagrangian of (P EI ) is the function (5a) (5b) 0 (λ ) I c I (x ) 0. (5c) l : (x,λ) E R m l(x,λ) = f(x)+λ T c(x). The complementarity condition (5c) means (λ ) I 0, (λ ) T I c I (x ) = 0, and c I (x ) / 110

15 Background Optimization (problem (P EI ) constraint qualification) The tangent cone T x X EI is always contained in the linearizing cone T x X EI := {d E : c E (x) d = 0, c I 0 (x)(x) d 0}. It is said that the constraint c is qualified for representing X EI at x if T x X EI = T x X EI. (6) Sufficient conditions of qualification: continuity and/or differentiability and one of the following (CQ-A) ce I 0 (x) is affine near x (Affinity), (CQ-S) ce is affine, c I 0 (x) is componentwise convex, ˆx X EI such that c I 0 (x)(ˆx) < 0 (Slater), (CQ-LI) i E I 0 (x) α i c i (x) = 0 = α = 0 (Linear Independence), (CQ-MF) i E I 0 (x) α i c i (x) = 0 and α I0 (x) 0 = α = 0 (Mangasarian-Fromovitz). 32 / 110 Background Optimization (problem (P EI ) more on constraint qualification) Proposition (other forms of (CQ-MF)) Suppose that c E I0 (x) is differentiable at x X EI. Then the following properties are equivalent: i (CQ-MF) holds at x, ii v R m, d E: c E (x) d = v E and c I 0 (x) (x) d v I 0 (x), iii c E (x) is surjective and d E: c E (x) d = 0 and c I 0 (x)(x) d < / 110

16 Background Optimization (problem (P EI ) SC1 for convex problem) Theorem (SC1 for convex (P EI )) If (P EI ) is convex, if f and c are differentiable at x X EI, and if there is a λ F such that (x,λ ) satisfies (5a)-(5b), then x is a global minimum of (P EI ). No need of constraint qualification. The goal of the first part of this course is to extend the previous NC1 and SC1 to a more general problem and to derive second order optimality conditions. 34 / 110 Background Optimization (linear optimization duality) Let c E (a Euclidean vector space), A : E R m and B : E R p linear, a R m, and b R p. A linear optimization problem (P L ) and its dual (D L ) read (P L ) Properties inf x E c,x Ax = a Bx b and (D L ) sup (y,s) R m R p at y b T s A y B s = c s 0. (P L ) has a solution val(p L ) R, val(d L ) val(p L ) [named weak duality], (P L ), (D L ) feasible Sol(P L ) Sol(D L ). (7) When (7) holds [named strong duality], val(d L ) = val(p L ). 35 / 110

17 Background Algorithmics (speeds of convergence) Let E be a normed space and {x k } E be a sequence converging to x. {x k } is said to converge linearly, if r [0,1[ and K N such that k K, there holds x k+1 x r x k x. Depends on the norm of E. {x k } is said to converge superlinearly, if x k+1 x = o( x k x ). Independent of the norm of E. Faster than linear convergence. Typical of the quasi-newton methods. {x k } is said to converge quadratically, if x k+1 x = O( x k x 2 ). Independent of the norm of E. Faster than superlinear convergence. Typical of Newton s method. 37 / 110 Background Algorithmics (Dennis & Moré criterion for superlinear convergence) Let F : E F and consider the nonlinear system to solve in x E: F(x) = 0. A quasi-newton algorithm locally generates a sequence {x k } by the recurrence F(x k )+M k (x k+1 x k ) = 0, (8) where M k L(E,F) is an approximation of F (x k ), generated by the algorithm. Proposition (Dennis & Moré criterion for superlinear convergence) If F is differentiable at a zero x of F, F (x ) is nonsingular, {x k } generated by (8) converges to x, then the convergence is superlinear if and only if ( ) M k F (x ) (x k+1 x k ) = o( x k+1 x k ). 38 / 110

18 Background Algorithmics (global convergence in unconstrained optimization - line-search) Consider the problem of minimizing a function f : E R (no constraint): min f(x). x E Global convergence means only g k := f(x k ) 0 (or even something weaker). It can be obtained with line-search or trust-region algorithms. A line-search algorithm generates a sequence {x k } as follows. Choice of a descent direction at xk, i.e., d k E such that f (x k ) d k < 0 (standard examples: gradient, conjugate gradient, Newton, quasi-newton, Gauss-Newton). Determination of a step-size αk > 0 by a line-search rule to force the decrease of f along d k (standard examples: Cauchy, Armijo, Wolfe). xk+1 := x k +α k d k. Proposition (global convergence by line-search) If {f(x k )} is bounded below, then k 1 g k 2 cos 2 θ k < +, where θ k := arccos g k,d k /( g k d k ) is the angle between d k and g k. 39 / 110 Background Algorithmics (global convergence in unconstrained optimization - trust-region) A trust-region algorithm generates a sequence {x k } as follows. Choice of a model ϕ of f around xk, usually quadratic: ϕ(s) := g k,s M ks,s (standard models: gradient, Newton, quasi-newton, Gauss-Newton). Determination of a trust-radius k > 0 such that f(x k +s k ) is sufficiently smaller than f(x k ), where s k arg min{ϕ(s) : s k }. xk+1 := x k +s k. Proposition (global convergence by trust-region) If {f(x k )} is bounded below and {M k } is bounded, then g k / 110

19 Background Algorithmics (global convergence for nonlinear systems) Let F : E E and consider the problem of finding x such that F(x) = 0. No algorithm with global convergence (exception, e.g., when F is polynomial, but the methods are expensive or for rather small dimensions). A less ambitious goal: minimizing the least-squares objective min x E ( ϕ(x) := 1 2 F(x) 2 2 Amazing but misleading fact: if ϕ(x k ) 0, the Gauss-Newton direction d GN k ). arg min F(x k )+F (x k )d 2 2 d E is a descent direction of ϕ at x k. May yield false convergence. It is better to use the trust-region approach: x k+1 = x k +s k, where with well adapted trust-radius k > 0. s k arg min s 2 k F(x k )+F (x k )s 2 2, 41 / 110 Optimality conditions First order optimality conditions for (P G ) (the problem) [WP.fr] We consider the problem (P G ) { min f(x) c(x) G, where f : E R (E is a Euclidean vector space), c : E F (F is another Euclidean vector space), G is nonempty closed convex set in F, The feasible set is denoted by X G := {x E : c(x) G} = c 1 (G). (P G ) is said to be convex if f is convex, T : E F : x c(x) G is convex (then XG is convex). 43 / 110

20 Optimality conditions First order optimality conditions for (P G ) (tangent and linearizing cones, qualification) Proposition (tangent and linearizing cones) If c is differentiable at x X G, then T x X G T x X G := {d E : c (x) d T c(x) G}. T x X G is called the linearizing cone to X at x. The equality T x X G = T x X G is not guaranteed. The constraint function c is said to be qualified for representing X G at x if T x X G = T x X G, c (x) [(T c(x) G) + ] is closed. (9a) (9b) 44 / 110 Optimality conditions First order optimality conditions for (P G ) (NC1) Theorem (NC1 for (P G )) If x is a local minimum of (P G ), if f and c are differentiable at x, and if c is qualified for representing X G at x in the sense (9a)-(9b), then 1 there exists a multiplier λ F such that f(x )+c (x ) λ = 0, λ N c(x )G. (10a) (10b) 2 if, furthermore, G K is a convex cone, then (10b) can be written K λ c(x ) K or c(x ) N λ K. (10c) One recognizes in (10a) the gradient of the Lagrangian wrt x l : E F R : x l(x,λ) = f(x)+ λ,c(x) and in (10c) the complementarity conditions. 45 / 110

21 Optimality conditions First order optimality conditions for (P G ) (SC1 for convex problem) Theorem (SC1 for convex (P G )) If (P G ) is convex, if f and c are differentiable at x X G, and if there is a λ F such that (x,λ ) satisfies (10a)-(10b), then x is a global minimum of (P G ). The goal of the next slides is to highlight and analyze a condition (Robinson s condition) that 1 provides an error bound for the feasible set y +X G (y small), 2 claims the stability of the feasible set X G, 3 ensures that c is qualified for representing X G. 46 / 110 Optimality conditions First order optimality conditions for (P G ) (Robinson s condition) [WP.fr] We say that Robinson s condition holds at x X G if (CQ-R) We will see below that it is useful since it ( ) 0 int c(x)+c (x) E G. (11) provides an error bound for small perturbations y +XG of X G, claims the stability of the feasible set XG with respect to small perturbations, shows that the constraint function c is qualified for representing XG at x, in the sense (9a)-(9b), it generalizes to (P G ) the Mangasarian-Fromovitz constraint qualification (CQ-MF) for (P EI ). 47 / 110

22 Optimality conditions First order optimality conditions for (P G ) (Robinson s error bound I) [16] Theorem ((CQ-R) and metric regularity) If c is continuously differentiable near x 0 X G, then the following properties are equivalent: 1 (CQ-R) holds at x = x 0, 2 there exists a constant µ 0, such that (x,y) near (x 0,0): dist(x,c 1 (y +G)) µ dist(c(x),y +G). (12) Condition 2 is named metric regularity since it is equivalent to that property (see its definition below) for the multifunction T : E F : x c(x) G. 48 / 110 Optimality conditions First order optimality conditions for (P G ) (Robinson s error bound II) [16] An error bound is an estimation of the distance to a set by a quantity easier to compute. Corollary (Robinson s error bound) If c is continuously differentiable near x 0 X G and if (CQ-R) holds at x = x 0, then there exists a constant µ 0, such that x near x 0 : dist(x,x G ) µ dist(c(x),g). (13) dist(x,x G ) is often difficult to evaluate in E, dist(c(x),g) is often easier to evaluate in F (it is the case if c(x) is easy to evaluate and G is simple), useful in theory (e.g., for proving that (CQ-R) implies constraint qualification), in algorithmics (e.g., for proving [speed of] converge) or in practice (for estimating dist(x,x G )). 49 / 110

23 Optimality conditions First order optimality conditions for (P G ) (stability wrt small perturbations) The estimate (12) readily implies the following stability result. Corollary (stability wrt small perturbations) If c is continuously differentiable near some x 0 X G and if (CQ-R) holds at x 0, then, for all small y F: {x E : c(x) y +G}. 50 / 110 Optimality conditions First order optimality conditions for (P G ) (open multifunction theorem I) [20, 18, 5] Let T : E F be a multifunction and B E and B F be the closed balls of E and F. T is open at (x 0,y 0 ) G(T) with ratio ρ > 0 if there exist a neighborhood W of (x 0,y 0 ) and a radius r max > 0 such that for all (x,y) W G(T) and r [0,r max ]: y +ρr B F T(x +r B E ). T is metric regular at (x 0,y 0 ) G(T) with modulus µ > 0 if for all (x,y) near (x 0,y 0 ): dist(x,t 1 (y)) µ dist(y,t(x)). Difference: (x,y) G(T) for the openness, but not for the metric regularity. y slope ρ G(T) T(x) y +ρ(2r) B F y +ρr B F y dist(y,t(x)) r T(x) x x x 0 y 0 slope 1/µ G(T) T 1 (y) dist(x,t 1 (y)) Both estimate the change in T(x) with x (but these are not infinitesimal notions), either from inside G(T) (openness) or outside (metric regularity). 51 / 110

24 Optimality conditions First order optimality conditions for (P G ) (open multifunction theorem II) [20, 18, 5] Extention of the open mapping theorem for linear (continuous) maps to (nonlinear) convex multifunctions. Theorem (open multifunction theorem, finite dimension) If T : E F is convex and (x 0,y 0 ) G(T), then the following properties are equivalent: 1 y 0 intr(t), 2 for all r > 0, y 0 intt(x +r B E ), 3 T is open at (x 0,y 0 ) with rate ρ > 0, 4 T is metric regular at (x 0,y 0 ) with modulus µ > 0. One can take µ = 1/ρ in point 4 if ρ is given by point / 110 Optimality conditions First order optimality conditions for (P G ) (metric regularity diffusion I) [6] ϕ : E R + is subcontinuous if for all sequences {x k } x such that ϕ(x k ) 0, there holds ϕ(x) = 0. ϕ : E R + is r-steep at x E, with r [0,1[, if for all x B(x,ϕ(x)/(1 r)), there exists x B(x,ϕ(x )), such that ϕ(x ) r ϕ(x ). Lemma (Lyusternik [15, 6]) If ϕ : E R + is subcontinuous and r-steep at x domϕ, with r [0,1[, then ϕ has a zero in B(x,ϕ(x)/(1 r)). 53 / 110

25 Optimality conditions First order optimality conditions for (P G ) (metric regularity diffusion II) [6] Theorem (metric regularity diffusion) Suppose that c : E F a continuous function, G a closed convex set of F, T : E F : x c(x) G is µ-metric regular at (x 0,y 0 ) G(T), δ : E F is Lipschitz near x 0 with modulus L < 1/µ, T : E F : x c(x)+δ(x) G. Then T is also metric regular at (x 0,y 0 +δ(x 0 )) G( T) with modulus µ/(1 Lµ): for all (x,y) near (x 0,y 0 +δ(x 0 )), there holds No need of convexity. dist(x, T 1 (y)) µ 1 Lµ dist(y, T(x)). (14) 54 / 110 Optimality conditions First order optimality conditions for (P G ) (qualification with (CQ-R) I) Proposition (other forms of (CQ-R)) Suppose that c is differentiable at x X G. Then the following properties are equivalent i 0 int(c(x)+c (x)e G) [this is (CQ-R)], ii c (x)e T f c(x) G = F, iii c (x)e T c(x) G = F, iv c (x)e T c(x) G = F. 55 / 110

26 Optimality conditions First order optimality conditions for (P G ) (qualification with (CQ-R) II) Proposition (qualification with (CQ-R)) Suppose that c is continuously differentiable near x X G and that (CQ-R) holds at x. Then c is qualified for representing X G at x. The figure below shows how (CQ-R) is used to create an appropriate sequence {x k } in X G := c 1 (G) to get qualification (the figure requires some oral explanations, though... ). X G := c 1 (G) x k d T x X G x k x c : E F c(x k ) c(x) c (x) d T c(x) G c(x k ) y k G 56 / 110 Optimality conditions First order optimality conditions for (P G ) (qualification with (CQ-R) III) Proposition (Gauvin s boundedness property) Suppose that f and c are differentiable at x X G and that the set Λ of multipliers λ F satisfying (10a)-(10b) is nonempty. Then 1 Λ = [c (x )E T c(x )G] +, 2 Λ is bounded if and only if (CQ-R) holds. The property was originally established for problem (P EI ) and (CQ-MF) [8]. 57 / 110

27 Optimality conditions Second order optimality conditions for (P EI ) [WP.fr] Let x be a local solution to (P EI ), λ be an associated optimal multiplier, and L := 2 xx l(x,λ ). In view of the second order optimality conditions of the equality constrained optimization problem (P E ), it is tempting to claim that d T x X EI : L d,d 0. But this is not guaranteed! The good cone is not T x X EI but the critical cone C (to be defined). The multiplier λ must be chosen, depending on d C. 59 / 110 Optimality conditions Second order optimality conditions for (P EI ) (critical cone) Critical cone at x X EI (it is a part of T x X EI ) C(x) := {d E : c E (x) d = 0, c I 0 (x) (x) d 0, f (x) d 0}. Short notation for index sets I 0 := {i I : c i (x ) = 0} := I 0 (x ), I 0+ := {i I : c i (x ) = 0, (λ ) i > 0}, I 00 := {i I : c i (x ) = 0, (λ ) i = 0}. Other forms of the critical cone at a stationary pair (x,λ ): C = {d E : c E (x ) d = 0, c I 0 (x ) d 0, f (x ) d = 0}, = {d E : c (x E I 0+ ) d = 0, c I (x ) d 0} / 110

28 Optimality conditions Second order optimality conditions for (P EI ) (three instructive examples) Three instructive examples 1 Strong NC2 (not always true): λ Λ, d C : L d,d 0. { min x2 x 2 x 2 1, 2 Semi-strong NC2 (not always true): λ Λ, d C : L d,d 0. min x 2 x 2 x1 2 x x Weak NC2 (always true): d C, λ Λ : L d,d 0. min x 3 x 3 (x 1 +x 2 )(x 1 x 2 ) x 3 (x 2 + 3x 1 )(2x 2 x 1 ) x 3 (2x 2 +x 1 )(x 2 3x 1 ). 61 / 110 Optimality conditions Second order optimality conditions (NC2) Notation at a stationary point x : Λ := {λ R m : (x,λ ) satisfies the KKT system (5)}. L := 2 xx l(x,λ ) for some specified λ Λ. Theorem (NC2 for (P EI )) Suppose that x is a local minimum of (P EI ), f and c E are C 2 near x, c I 0 is twice differentiable at x, c I\I 0 is continuous at x, (CQ-MF) holds at x, then d C, λ Λ such that L d,d 0. These conditions are named weak second order necessary conditions and also read d C : max L d,d 0. (15) λ Λ 62 / 110

29 Optimality conditions Second order optimality conditions (SC2) Theorem (SC2 for (P EI )) Suppose that f and c E I 0 are twice differentiable at x, (x,λ ) verifies the KKT conditions (5), the following equivalent conditions hold d C \{0}, λ Λ : L d,d > 0, γ > 0, d C, λ Λ : L d,d γ d 2, (16a) (16b) then γ [0, γ[, a neighborhood V of x, x (V \{x }) X EI : f(x) > f(x )+ γ 2 x x 2. (17) In particular, x is a strict local minimum of (P EI ). 63 / 110 Optimality conditions Second order optimality conditions (SC2) Condition (17) is called the quadratic growth property. The property λ Λ : d C \{0}, L d,d > 0 is stronger than (16a)-(16b) and is called the semi-strong SC2. The even stronger property λ Λ : d C \{0}, L d,d > 0 is called the strong SC2. 64 / 110

30 Linearization methods Overview Two classes of linearization methods that are used to solve systems with nonsmoothness. 1 Methods that capture much of the local behavior of the system. Features Expensive iteration, fast convergence, easy to globalize. Examples The Josephy-Newton algorithm for functional inclusions. The SQP algorithm for (P EI ). 2 Methods that use a single piece of the local behavior of the system. Features Cheap iteration, fast convergence, difficult to globalize. Example The semismooth Newton algorithm for nonsmooth system of equations. 65 / 110 Linearization methods Josephy-Newton algorithm for functional inclusions (functional inclusion I) [WP.fr] Let E and F be Euclidean vector spaces of the same dimension, F : E F be a smooth function, and N : E F be a multifunction. We consider the functional inclusion problem (P FI ) F(x)+N(x) 0. (18) Interested in algorithmic issues (not theoretical ones, like existence of solution). Examples 1 The variational problem if N = N X the normal cone to X E at x (= if x / X): (P V ) F(x)+N X (x) 0. (19) 2 The variational inequality problem is (P V ) with X = C (a convex): { x C (P VI ) F(x),y x 0, y C. 3 The complementarity problem is (P FI ) with N = N K G (closed convex cone K F, G : E F) (20) (P CP ) K G(x) F(x) K +. (21) 67 / 110

31 Linearization methods Josephy-Newton algorithm for inclusions (functional inclusion II) [WP.fr] Examples (continued) 4 The Peano-Kantorovitch NC1 of problem min{f(x) : x X} reads f(x)+n X (x) 0. 5 The first order optimality conditions for (P G ), when G is a closed convex cone, can be written like (P CP ) with the variable x (x,λ) E F, ( ) f(x)+c (x) λ F(x) F(x,λ) =, and K = E G. (22) c(x) 6 If N is the constant multifunction x {0 m E R } Rm I + R m F where E and I make a partition of [1:m], (P FI ) becomes the system of equalitites and inequalitites F E (x) = 0 and F I (x) / 110 Linearization methods Josephy-Newton algorithm for inclusions (the JN algorithm) [13, 14] [WP.fr] Algorithm (Josephy-Newton algorithm for solving (P FI )) Given x k, compute x k+1 as a solution to the problem in x: F(x k )+M k (x x k )+N(x) 0, (23) where M k = F (x k ) or an approximation to it. Remarks Only F is linearized, not N (is the reason for the chosen structure of (P FI )). (23) captures more information from (P FI ) than a simple linearization. (23) is often a nonlinear problem, hence yielding an expensive iteration. Makes sense computationally if N is sufficiently simple. 69 / 110

32 Linearization methods Josephy-Newton algorithm for inclusions (semistability) [3] A solution x to (P FI ) is said semistable if σ 1 > 0 and σ 2 > 0 such that for any (x,p) satisfying F(x)+N(x) p and x x σ 1 there holds x x σ 2 p. Remarks The perturbed inclusion F(x)+N(x) p is not required to have a solution. The property is demanding only for small perturbations p (i.e., p < σ 1 /σ 2 ). Semistability implies that x is the unique solution to (P FI ) on B(x,σ 1 ). If N {0}, then semistability of x injectivity of F (x ) (hence, nonsingularity if dime = dimf). 70 / 110 Linearization methods Josephy-Newton algorithm for inclusions (speed of convergence) [3] Proposition (speed of convergence of quasi-newton methods) Suppose that F is C 1 near a semistable solution x to (P FI ). Let {x k } be a sequence generated by algorithm (23), converging to x. 1 If (M k F (x ))(x k+1 x k ) = o( x k+1 x k ), then {x k } converges superlinearly. 2 If (M k F (x ))(x k+1 x k ) = O( x k+1 x k 2 ) and F is C 1,1 near x, then {x k } converges quadratically. Corollary (speed of convergence of Newton s method) Suppose that F is C 1 near a semistable solution x to (P FI ). Let {x k } be a sequence generated by algorithm (23) with M k = F (x k ), converging to x. Then 1 {x k } converges superlinearly, 2 if, furthermore F is C 1,1 near x, then {x k } converges quadratically. 71 / 110

33 Linearization methods Josephy-Newton algorithm for inclusions (semistability for polyhedral VI) [3] Proposition (semistability characterization for polyhedral VI) Consider problem (P VI ) with a closed convex polyhedron C and F being C 1 near a solution x. Then the following properties are equivalent: 1 x is semistable, 2 x is an isolated solution to F(x )+F (x )(x x )+N C (x) 0, 3 there holds F (x )(x x ),x x > 0 when x C \{x } satisfies F(x ),x x = 0 and F(x )+F (x )(x x )+N C (x ) 0, 4 x is the unique solution to N C (x) N C (x ), F(x ),x x = 0, R + F(x )+F (x )(x x )+N C (x) / 110 Linearization methods Josephy-Newton algorithm for inclusions (local convergence) [3] A solution x to (P FI ) is said hemistable if α > 0, β > 0, x 0 B(x,β), the linearized inclusion in x has a solution in B(x,α). F(x 0 )+F (x 0 )(x x 0 )+N(x) 0 Theorem (local convergence of JN) Suppose that F is C 1 near a semistable and hemistable solution x to (P FI ). Then ε > 0 such that if x 1 B(x,ε), then 1 the JN algorithm (23) with M k = F (x k ) can generate {x k } B(x,ε), 2 any sequence {x k } generated in B(x,ε) by the JN algorithm converges superlinearly to x (quadratically if F is C 1,1 ). 73 / 110

34 Linearization methods The SQP algorithm (overview) Recall the equality and inequality constrained problem inf x f(x) (P EI ) c E (x) = 0 c I (x) / 110 Linearization methods The SQP algorithm (definition - KKT is a nonlinear complementarity problem) [WP.fr] Similarly to Newton s method in unconstrained optimization, the SQP algorithm is conceptually interested in solutions of the first order optimality (KKT) system in (x,λ) of (P EI ): x l(x,λ) = 0, (24a) c E (x) = 0, (24b) 0 λ I c I (x) 0. (24c) This system in (x,λ) can be written like (P FI ) or (P CP ), namely F(x,λ)+N K (x,λ) 0 or K (x,λ) F(x,λ) K +, (25) with the data F(x,λ) = ( ) x l(x,λ) c(x) and K = E (R m E R m I + ). (26) N K (x,λ) = {0 E } ( {0 R m E } N m (λ R I I) ), K + = {0 + E } ( {0 R m E } R m ) I + ). 76 / 110

35 Linearization methods The SQP algorithm (definition - the SQP algorithm viewed as a JN method) [17] [WP.fr] The SQP algorithm (SQP for Sequential Quadratic Programming) for solving (P EI ) is the JN algorithm (23) on the inclusion (25)-(26): x l(x k,λ k )+M k (x x k )+c (x k ) (λ λ k ) = 0, c E (x k )+c E(x k ) (x x k ) = 0, (27a) (27b) 0 λ I ( c I (x k )+c I(x k ) (x x k ) ) 0, (27c) where M k = L k := 2 xx l(x k,λ k ) or an approximation to it (M k 2 f(x k )!). (27) is formed of the KKT conditions of the osculating quadratic problem min x f(x k ),x x k M k(x x k ),x x k (OQP) c E (x k )+c E (x k) (x x k ) = 0, (28) c I (x k )+c I (x k) (x x k ) 0. The primal-dual solution (x k+1,λ k+1 ) to the OQP is the new iterate. 77 / 110 Linearization methods The SQP algorithm (definition - the local SQP algorithm) [10, 11] [WP.fr] Algorithm (local SQP) From (x k,λ k ) to (x k+1,λ k+1 ): 1 If (24) holds at (x,λ) = (x k,λ k ), stop. 2 Compute a primal-dual stationary point (x k+1,λ k+1 ) of the OQP (28). Remarks The OQP s are still hard to solve (not just a linear system, expensive iteration): If M k = L k 0, the OQP is NP-hard. One of the good reasons for taking M k L k with M k 0, updated by a quasi-newton method; the OQP is then polynomial, but still difficult. Other (non local) difficulties to overcome: What if the linearized constraints are incompatible? What if the OQP is unbounded? Nothing is done for forcing the convergence from remote starting points. 78 / 110

36 Linearization methods The SQP algorithm (stability result for a polyhedral (P K ) I) [17] Let P be a vector space. For p P, consider the perturbated problem { (P p K ) min f(x,p) c(x,p) K, where f : E P R, c : E P F, K F convex polyhedral cone. Optimality system at x E: λ F such that { x f(x,p)+c x(x,p) λ = 0 K λ c(x,p) K.. (29) The multiplier multifunction Λ : E P F is defined at (x,p) E P by Λ(x,p) := {λ F : (x,p,λ) satisfies (29)}. The stationnary multifunction St : P E is defined at p P by St(p) := {x E : Λ(x,p) }. 79 / 110 Linearization methods The SQP algorithm (stability result for a polyhedral (P K ) II) Assume the framework defined above. Proposition (stability of (P K )) If f x, c and c x are Lipschitz on U P N(x 0 ) N(p 0 ), (x 0,λ 0 ) satisfies the strong SC2 for (P p 0 K ), (CQ-R) holds at x 0, meaning that 0 int ( c(x 0,p 0 )+c x(x 0,p 0 )E K ), then W N(p 0 ), L 0, p W: 1 St(p), 2 x St(p) near x 0, λ Λ(x,p): dist ( (x,λ),{x 0 } Λ(x 0,p 0 ) ) L p p / 110

37 Linearization methods The SQP algorithm (local convergence - semistability and hemistability of a KKT pair) [3] Proposition (semistability and SC2) If (x,λ ) satisfies KKT, then the following properties are equivalent: 1 x is a local minimizer and (x,λ ) is semistable, 2 Λ = {λ } and x satisfies SC2. At a local solution, semistability implies hemistability: Proposition (SC for hemistability) If x is a local minimizer of (P EI ), (x,λ ) satisfies the KKT conditions, (x,λ ) is semistable, then (x,λ ) is hemistable. 81 / 110 Linearization methods The SQP algorithm (local convergence - the result) [3] Theorem (local convergence of SQP) If f and c are C 2,1 near a local minimizer x of (P EI ), there is a unique multiplier λ associated with x, SC2 is satified, then there exists a neighborhood V of (x,λ ) such that if the first iterate (x 1,λ 1 ) V, then 1 the SQP algorithm can generate {(x k,λ k )} in V, 2 any sequence {(x k,λ k )} generated in V by the SQP algorithm converges quadratically to (x,λ ). 82 / 110

38 M.F. Anjos, J.B. Lasserre (2012). Handbook on Semidefinite, Conic and Polynomial Optimization. Springer. [doi]. A. Ben-Tal, A. Nemirovski (2001). Lectures on Modern Convex Optimization Analysis, Algorithms, and Engineering Applications. MPS-SIAM Series on Optimization 2. SIAM. J.F. Bonnans (1994). Local analysis of Newton-type methods for variational inequalities and nonlinear programming. Applied Mathematics and Optimization, 29, [doi]. J.F. Bonnans, J.Ch. Gilbert, C. Lemaréchal, C. Sagastizábal (2006). Numerical Optimization Theoretical and Practical Aspects (second edition). Universitext. Springer Verlag, Berlin. [authors] [editor] [doi]. J.F. Bonnans, A. Shapiro (2000). Perturbation Analysis of Optimization Problems. Springer Verlag, New York. R. Cominetti (1990). Metric regularity, tangent sets, and second-order optimality conditions. Applied Mathematics and Optimization, 21(1), [doi]. A.L. Dontchev, R.T. Rockafellar (2009). Implicit Functions and Solution Mappings A View from Variational Analysis. Springer Monographs in Mathematics. Springer. J. Gauvin (1977). A necessary and sufficient regularity condition to have bounded multipliers in nonconvex programming. Mathematical Programming, 12, J.Ch. Gilbert (2015). Éléments d Optimisation Différentiable Théorie et Algorithmes. Syllabus de cours à l ENSTA, Paris. [internet]. S.-P. Han (1976). Superlinearly convergent variable metric algorithms for general nonlinear programming problems. Mathematical Programming, 11, S.-P. Han (1977). A globally convergent method for nonlinear programming. Journal of Optimization Theory and Applications, 22, A.F. Izmailov, M.V. Solodov (2014). 110 / 110 Newton-Type Methods for Optimization and Variational Problems. Springer Series in Operations Research and Financial Engineering. Springer. N.H. Josephy (1979). Newton s method for generalized equations. Technical Summary Report 1965, Mathematics Research Center, University of Wisconsin, Madison, WI, USA. N.H. Josephy (1979). Quasi-Newton s method for generalized equations. Summary Report 1966, Mathematics Research Center, University of Wisconsin, Madison, WI, USA. L. Lyusternik (1934). Conditional extrema of functionals. Math. USSR-Sb., 41, S.M. Robinson (1976). Stability theory for systems of inequalities, part II: differentiable nonlinear systems. SIAM Journal on Numerical Analysis, 13, S.M. Robinson (1982). Generalized equations and their solutions, part II: applications to nonlinear programming. Mathematical Programming Study, 19, R.T. Rockafellar, R. Wets (1998). Variational Analysis. Grundlehren der mathematischen Wissenschaften 317. Springer. H. Wolkowicz, R. Saigal, L. Vandenberghe (editors) (2000). Handbook of Semidefinite Programming Theory, Algorithms, and Applications, volume 27 of International Series in Operations Research & Management Science. Kluwer Academic Publishers. J. Zowe, S. Kurcyusz (1979). Regularity and stability for the mathematical programming problem in Banach spaces. Applied Mathematics and Optimization, 5(1), / 110

Advanced Continuous Optimization

University Paris-Saclay Master Program in Optimization Advanced Continuous Optimization J. Ch. Gilbert (INRIA Paris-Rocquencourt) September 26, 2017 Lectures: September 18, 2017 November 6, 2017 Examination: