MULTIPLE GRADIENT DESCENT ALGORITHM (MGDA) FOR MULTIOBJECTIVE OPTIMIZATION

Size: px

Start display at page:

Download "MULTIPLE GRADIENT DESCENT ALGORITHM (MGDA) FOR MULTIOBJECTIVE OPTIMIZATION"

Scott Fields
5 years ago
Views:

1 applied to MULTIPLE GRADIENT DESCENT ALGORITHM () FOR MULTIOBJECTIVE OPTIMIZATION description and numerical Jean-Antoine Désidéri INRIA Project-Team OPALE Sophia Antipolis Méditerranée Centre I ISMP st International Synposium on Mathematical Programming, Berlin, August 19-24, 2012 PDE-Constrained Optimization and Multi-Level/Multi-Grid Methods ECCOMAS 2012 European Congress on Computational Methods in Applied Sciences and Engineering, Vienna (Austria), September 10-14, 2012 TO16: Computational inverse problems and optimization 1 / 102

2 applied to description and numerical I description and numerical 5 6 I Outline 2 / 102

3 applied to description and numerical I vector: : Pareto optimality Y Υ R N Objective functions to be minimized (reduced): Dominance in efficiency J i (Y ) (i = 1,...,n) (see e.g. K. Miettinen: Nonlinear Multiobjective Optimization, Kluwer Academic Publishers) -point Y 1 dominates design-point Y 2 in efficiency, Y 1 Y 2, iff J i (Y 1 ) J i (Y 2 ) ( i = 1,...,n) and at least one inequality is strict. (relationship of partial order) The designer s Holy Grail: the Pareto set set of "non-dominated design-points" or "Pareto-optimal solutions" The Pareto front image of the Pareto set in the objective function space (bounds the domain of attainable performance) 3 / 102

4 applied to description and numerical Pareto front identification Favorable situation: continuous and convex Pareto front Gradient-based optimization can be used efficiently, if number n of objective functions is not too large Non-convex or discontinuous Pareto fronts do exist Evolutionary strategies or GA s most commonly-used for robustness: NSGA-II (Deb), PAES From: A Fast and Elitist Multiobjective Genetic : NSGA-II, K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, IEEE Transactions in Evolutionary Computation, Vol. 6, No. 2, April I Nondominated solutions with NSGA-II on KUR (testcase by F. Kursawe, 1990) n = 3 f 1 (x) = n 1 10exp i=1 x 2 i + x 2 5 i+1 f 2 (x) = ( n i=1 x i sinx 3 i ) 4 / 102

5 applied to Our general objective description and numerical Construct a robust gradient-based method for Pareto front identification Note : in shape optimization, N is often the dimension of a parameterization, meant to be large as the dimension of a discretization usually is; thus, often, N n. However, the following theoretical results also apply to cases where n > N which can be the case of other situations in engineering optimization. I 5 / 102

6 applied to description and numerical I description and numerical 5 6 I Outline 6 / 102

7 applied to description and numerical The basic question : Constructing a gradient-based cooperative algorithm Knowing the gradients, J i (Y 0 ), of n criteria, assumed to be smooth functions of the design vector Y R N, at a given starting design-point Y 0, can we define a nonzero vector ω R N, in the direction of which the Fréchet derivatives of all criteria have the same sign, ( Ji (Y 0 ),ω ) 0 ( i = 1,2,...,n)? I Answer: Yes, if Y 0 is not Pareto-optimal! Then, ω is locally a descent direction common to all criteria. 7 / 102

8 applied to description and numerical I Notion of Classically, for smooth real-valued functions of N variables Optimality (+ regularity) = Stationarity Now, for smooth R n -valued functions of N variables, which stationarity requirement is to be enforced for Pareto-optimality? Proposition 1: Pareto Stationarity at a design point Y 0 in an open domain B in which the objective-functions are smooth and convex, if there exists a convex combination of the gradients equal to 0: α = {α i } (i=1,,n) s.t. n α i J i (Y 0 n ) = 0, α i 0 ( i), α i = 1 i=1 i=1 Then (to be established subsequently): Pareto-optimality (+ regularity) = 8 / 102

9 applied to General principle description and numerical I Proposition 2: Minimum-norm element in the convex hull Let {u i } (i=1,...,n) be a family of n vectors in R N, and U the so-called convex hull of this family, i.e. the following set of vectors : { } U = u R N / u = n i=1 α i u i ; α i 0 ( i); n i=1 α i = 1 Then, U admits a unique element of minimal norm, ω, and the following holds : u U : (u,ω) ω 2 in which (u,v) denotes the usual scalar product of the vectors u and v. 9 / 102

10 applied to description and numerical I Existence and uniqueness of ω Existence = U closed. Uniqueness = U convex. To establish uniqueness, let ω 1 and ω 2 be two realisations of the minimum: ω 1 = ω 2 = Argmin u U u Then, since U is convex, ε [0,1], ω 1 + εr 12 U (r 12 = ω 2 ω 1 ) and: ω 1 + εr 12 ω 1 Square both sides, and remplace by scalar products: Then ε [0,1] : (ω 1 + εr 12,ω 1 + εr 12 ) (ω 1,ω 1 ) 0 ε [0,1] : 2ε(r 12,ω 1 ) + ε 2 (r 12,r 12 ) 0 As ε 0 +, this condition requires that (r 12,ω 1 ) 0; but then, for ε = 1: unless r 12 = 0, i.e. ω 2 = ω 1. ω 2 2 ω 1 2 = 2(r 12,ω 1 ) + (r 12,r 12 ) > 0 Remark: The identification of the element ω is equivalent to the constrained minimization of a quadratic form in R n, and not R N. 10 / 102

11 applied to First consequence description and numerical I u U : (u,ω) ω 2 Proof: Let u U, and r = u ω. ε [0,1], ω + εr U (convexity); hence: and this requires that: ω + εr 2 ω 2 = 2ε(r,ω) + ε 2 (r,r) 0 (r,ω) = (u ω,ω) 0 11 / 102

12 applied to description and numerical I Proof of at Pareto-optimal design-point Y 0 B (open ball) R N Hyp : {J i (Y )} ( i n) smooth and convex in B ; u i = J i (Y 0 ) ( i n) Without loss of generality, assume J i (Y 0 ) = 0 ( i). Then: Y 0 = ArgminJ n (Y ) subject to: J i (Y) 0 ( i n 1) (1) Y Let U n 1 be the convex hull of the gradients {u 1,u 2,...,u n 1 } and ω n 1 = Argmin u Un 1 u. The vector ω n 1 exists, is unique, and such that: ( u i,ω n 1 ) ωn 1 2 ( i n 1). Two possible situations: 1. ω n 1 = 0, and the objective-functions {J 1,J 2,...,J n 1 } satisfy the Pareto stationarity condition at Y = Y 0. A fortiori, the condition is also satisfied by the whole set of objective-functions. 2. Otherwise ω n 1 0. Let j i (ε) = J i (Y 0 εω n 1 ) (i = 1,...,n 1) so that j i (0) = 0 and j i (0) = ( u i,ω n 1 ) ωn 1 2 < 0, and for sufficiently small strictly-positive ε: j i (ε) = J i (Y 0 εω n 1 ) < 0 ( i n 1) and Slater s qualification condition is satisfied for (1). Thus, the Lagrangian L = J n (Y ) + n 1 i=1 λ i J i (Y ) is stationary w.r.t. Y at Y = Y 0 : n 1 u n + λ i u i = 0 i=1 in which λ i > 0 ( i n 1) since equality constraints hold J i (Y 0 ) = 0 (KKT condition). Normalizing this equation results in the condition. 12 / 102

13 applied to R N Affine and Vector Structures Notation description and numerical I Usually indistinct When distinction necessary, a dotted symbol is used for an affine subset/subspace In particular for the convex hull U the set of points vectors of a given origin O and equipollent to u U point to. Rigorously speaking, the term "convex hull" applies to U. 13 / 102

14 applied to R N Affine and Vector Structures A case where O / U Convex hulls description and numerical Ȧ 2 U (affine) O u 1 u 3 I O u U U (vector) u 2 14 / 102

15 applied to Note that u U: Second consequence description and numerical I u u n = n i=1 α i u i ( n i=1 α i ) n 1 u n = i=1 α i u n,i (u n,i = u i u n ) Thus U A n 1 (or in affine space, U Ȧn 1), where A n 1 is a set of vectors pointing to an affine sub-space Ȧn 1 of dimension at most n 1. Define the orthogonal projection O of O onto Ȧn 1 Since OO Ȧn 1 U: O U ω = OO In this case: u U : (u,ω) = const. = ω 2 15 / 102

16 applied to description and numerical I Second consequence Proof Since O is the orthogonal projection of O onto Ȧn 1, OO Ȧn 1, and in particular OO U. If additionally O U, vector OO is the minimum-norm element of U: ω = OO. In this case, ω can be calculated by ignoring the inequality constraints, i, α i 0, that are automatically satisfied by the solution. Hence, vector ω realizes the minimum of the quadratic form ω 2 = n i=1 α i u i 2 uniquely subject to the equality constraint: n i=1 α i = 1. The Lagrangian is formed with a unique multiplier λ: L(α,λ) = 1 2 ( n i=1 The stationarity condition requires that: i, ) n α i u i, α i u i + λ i=1 ( n α i 1 i=1 L α i = (u i,ω) + λ = 0 = (u i,ω) = λ = const. = ω 2 By convex combination, the result extends straightforwardly to the whole U. ) 16 / 102

17 applied to Application to the case of gradients description and numerical I Suppose u i = J i (Y 0 ) ( i); then: either ω 0, and all criteria admit positive Fréchet derivatives in the direction of ω (all equal if ω belongs to the interior U) or ω = 0, and the current design point Y 0 is Pareto stationary: {α i } i=1,2,...,n (α i 0, i; n α i = 1) so that i=1 n i=1 α i u i = 0 17 / 102

18 applied to description and numerical I Proposition 3: Common Descent Direction By virtue of Propositions 1 and 2 in the particular case where u i = J i = J i(y 0 ) S i (S i : user-supplied scaling constant; S i > 0), two situations are possible at Y = Y 0 : either ω = 0, and the design point Y 0 is Pareto-stationary (or Pareto-optimal); or ω 0, and ω is a descent direction common to all criteria {J i (x)} (i=1,...,n) ; additionally, if ω U, the scalar product (u,ω) (u U), and the Frechet derivatives (u i,ω) are all equal to ω 2. : substitute vector ω to the single-criterion gradient in the steepest-descent method. Proposition 4: Convergence Certain normalization provisions being made, in R N, the Multiple Gradient Descent () converges to a Pareto-stationary point. 18 / 102

19 applied to Convergence proof description and numerical I Define the criteria to be positive (possibly apply an exp-transform), and at (possibly add an exploding term outside of the working ball): Assume these criteria to be continuous. i, J i (Y) as Y. If the iteration stops in a finite number of steps, a Pareto-stationary point has been reached. Otherwise, the iteration continues indefinitely, generating an infinite sequence of design points, {Y k }. The corresponding sequences of criteria are infinite, positive and monotone decreasing. They are bounded. Hence, the sequence of iterates, {Y k }, is itself bounded and it admits a subsequence converging to say Y. Necessarily, Y is a Pareto-stationary point. To establish this, assume instead that ω, which corresponds to Y, is nonzero. Then for each criterion, there exists a stepsize ρ for which, the variation is finite. These criteria are in finite number. Hence, the smallest ρ wlll cause a finite variation to all criteria, and this is in contradiction with the fact that only infinitely-small variations of the criteria are realized from Y. 19 / 102

20 applied to description and numerical I to be solved in U R N Usually, but not necessarily : N n. Parameterization Let Remark 1 : practical determination of vector ω ω = Argmin u U u { n n U = u R N / u = α i u i ; α i 0 ( i); i=1 i=1 α i = σ 2 i (i = 1,...,n) in the convex hull α i = 1 to satisfy trivially the inequality constraints (α i 0, i), and transform the equality constraint, n i=1 α i = 1 into n σ 2 i = 1 σ = (σ 1,σ 2,...,σ n ) S n i=1 where S n is the unit sphere of R n and precisely not R N. } 20 / 102

21 applied to description and numerical I Determining vector ω (cont d) The sphere is easily parameterized using trigonometric functions of n 1 independent arcs φ 1,φ 2,...,φ n 1 : σ 1 = cosφ 1. cosφ 2. cosφ cosφ n 1 σ 2 = sinφ 1. cosφ 2. cosφ cosφ n 1 σ 3 = 1. sinφ 2. cosφ cosφ n 1. σ n 1 = sinφ n 2. cosφ n 1 σ n = sinφ n 1 (Consider only : φ i [0,π/2], i and set : φ 0 = π 2.) Let c i = cos 2 φ i and get: (i = 1,...,n) α 1 = c 1. c 2. c c n 1 α 2 = (1 c 1 ). c 2. c c n 1 α 3 = 1. (1 c 2 ). c c n 1. α n 1 = (1 c n 2 ). c n 1 α n = (1 c n 1 ) (c 0 = 0, and c i [0,1] for all i 1) / 102

22 applied to Determining vector ω (end) The convex hull is thus parameterized in description and numerical I [0,1] n 1 (n : number of criteria) independently of N (dimension of design space). For example, with n = 4 criteria : α 1 = c 1 c 2 c 3 α 2 = (1 c 1 )c 2 c 3 α 3 = (1 c 2 )c 3 α 4 = (1 c 3 ) (c 1,c 2,c 3 ) [0,1] 3 22 / 102

23 applied to description and numerical I u 2 Case n = 2 Remark 2 : appropriate normalization of the gradients may reveal to be essential : u i = J i (Y 0 )/S i Without gradient normalization Then, ω = OO unless the angle < u 1,u 2 > is acute, and the norms u 1, u 2 are very different; in that case, ω is equal the one of smaller norm. ω = OO u 2 u 1 ω = u 2 = ω u 1 OO u 1 OO With normalization: The equality ω = OO holds automatically = General Case EQUAL FRÉCHET DERIVATIVES IF the vectors {u i } are NOT NORMALIZED, those of smaller norms are more influential to determine the direction of vector ω 23 / 102

24 applied to description and numerical I Recall Normalizations being examined u i = J i = J i (Y 0 )/S i (S i > 0) = ω δy = ρω = δj i = ( J i,δy ) = ρs i (u i,ω) (u i,ω) = const. if ω does not lie on the edge of convex hull Standard : u i = J i (Y 0 ) J i (Y 0 ) Equal logorithmic variations (whenever ω is not on edge) : Newton-inspired when limj i = 0 : u i = u i = J i (Y 0 ) J i (Y 0 ) J i J i (Y 0 ) 2 J i (Y 0 ) Newton-inspired when limj i 0 : ( ) max u i = J (k 1) i J (k) i,δ J i (Y 0 ), k : iteration no., δ > 0 (small) J i (Y 0 ) 2 24 / 102

25 applied to description and numerical No, if n > 2 : Is standard normalization sufficient to guarantee that ω = OO? (yielding equal Fréchet derivatives) Unit Sphere I 0 Ȧ 2 u 1 u 2 u 3 O OO / U 25 / 102

26 applied to Experimenting From Adrien ZERBINATI, on-going doctoral thesis Basic testcase : minimize the functions f(x,y) = 4x 2 + y 2 + xy g(x,y) = (x 1) 2 + 3(y 1) 2 description and numerical I 26 / 102

27 applied to Basic testcase Convergence from different initial design points DESIGN SPACE FUNCTION SPACE description and numerical (0.5,2) (1.5,2.5) I (-1.5,-2.5) 27 / 102

28 applied to Fonseca testcase description and numerical Minimize 2 fonctions of 3 variables : ( (x i ± 3 1 ) ) 2 f 1,2 (x 1,x 2,x 3 ) = 1 exp 3 i=1 I 28 / 102

29 applied to Fonseca testcase (cont d) Convergence from initial design points over a sphere description and numerical DESIGN SPACE FUNCTION SPACE I Continuous but nonconvex front 29 / 102

30 applied to Fonseca testcase (end) Compare with Pareto Archived Evolution Strategy (PAES) description and numerical I x 3 DESIGN SPACE PAES 2x50 & 6x12 iterate PAES generate x 1 0 x FUNCTION SPACE iterate PAES generate PAES 2x50 & 6x / 102

31 applied to description and numerical I description and numerical 5 6 I Outline 31 / 102

32 applied to description and numerical I Wing-shape optimization Wave-drag minimization in conjunction with lift maximization Metamodel-assisted (Adrien Zerbinati) Wing-shape geometry extruded from airfoil geometry (initial airfoil : NACA0012), 10 parameters Two-point/two-objective optimization 1 Subsonic Eulerian flow : M = 0.3, α = 8 o, for C L maximization 2 Transonic Eulerian flow : M = 0.83, α = 2 o, for C D minimization - Initial database of 40 design points evolves as follows : 1 Evaluate each new point by two 3D Eulerian flow computations 2 Construct surrogate models for C L and C D (using the entire dataset), and perform to convergence 3 Enrich the database with new points, if appropriate After 6 loops, the database is made of 227 design points (554 flow computations) 32 / 102

33 applied to 6 loops Wing-shape optimization 2 Convergence of dataset description and numerical -LIFT (subsonic) lift coefficient initial database 1st step 2nd step 3rd step 4th step 5th step 6th step 0.7 I drag coefficient DRAG (transonic) 33 / 102

34 applied to Wing-shape optimization 3 Low-drag quasi-pareto-optimal shape description and numerical SUBSONIC TRANSONIC I M = 0.3, α = 8 o M = 0.83, α = 2 o 34 / 102

35 applied to Wing-shape optimization 4 High-lift quasi-pareto-optimal shape description and numerical SUBSONIC TRANSONIC I M = 0.3, α = 8 o M = 0.83, α = 2 o 35 / 102

36 applied to Wing-shape optimization 5 Intermediate quasi-pareto-optimal shape description and numerical SUBSONIC TRANSONIC I M = 0.3, α = 8 o M = 0.83, α = 2 o 36 / 102

37 applied to description and numerical I description and numerical 5 6 I Outline 37 / 102

applied to Description of the test case : description and numerical In-house CFD Solver (Num3sis) : Compressible Navier-Stokes (FV/FE) Parallel computing (MPI) CPU cost

38 applied to Description of the test case : description and numerical In-house CFD Solver (Num3sis) : Compressible Navier-Stokes (FV/FE) Parallel computing (MPI) CPU cost : 4h on 32 cores CAD and mesh by GMSH Test case : Mach number : 0.1 Reynolds number : Re 2800 Two-criterion shape optimization Fine mesh (some 700,000 points) I 38 / 102

39 applied to description and numerical Velocity Variance U mean = 1 V tot u(cel) Vol(cel) d(cel,p o ) ε First objective function σ 2 vel,x = 1 V tot (U mean,x u x (cel)) 2 Vol(cel) d(cel,p o ) ε σ 2 = σ 2 vel,x + σ2 vel,y + σ2 vel,z I 39 / 102

40 applied to Second objective function description and numerical Pressure loss p k = p mean for inlet (k = i), or outlet (k = o) u k = u mean for in- or out-let p = p i p o + ρ i ui 2 2 ρ ouo 2 2 I 40 / 102

41 applied to parameters description and numerical I 41 / 102

42 applied to on surrogate model description and numerical I 42 / 102

43 applied to description and numerical I 43 / 102

44 applied to description and numerical Pressure loss 2,5 2 1,5 Identifying the Pareto set Nominal Point Initial database 3rd step 6th step 9th step Initial Pareto front Final Pareto front I ,5 0,6 0,8 1 1,2 1,4 Velocity variance Figure: metamodel assisted step by step evolution. Initial and final non-dominated sets. 223 solver calls / 102

45 applied to Velocity Magnitude Streamlines description and numerical 1: BEST VELOCITY VARIANCE 2 : NOMINAL SHAPE I SHAPE COMPARISON 3 : BEST PRESSURE LOSS 45 / 102

46 applied to Velocity magnitude in a section just after the bend : description and numerical 1: BEST VELOCITY VARIANCE 2 : NOMINAL SHAPE I SECTION POSITION 3 : BEST PRESSURE LOSS 46 / 102

47 applied to Velocity Magnitude in output section : description and numerical 1 : BEST VELOCITY VARIANCE 2 : NOMINAL SHAPE I SECTION POSITION 3 : BEST PRESSURE LOSS 47 / 102

48 applied to NOMINAL versus LOWEST VELOCITY VARIANCE description and numerical I 48 / 102

49 applied to NOMINAL versus LOWEST PRESSURE LOSS description and numerical I 49 / 102

50 applied to LOWEST VELOCITY VARIANCE versus LOWEST PRESSURE LOSS description and numerical I 50 / 102

51 applied to description and numerical I description and numerical 5 6 I Outline 51 / 102

52 y applied to Dirichlet problem Discrete solution by standard 2nd-order finite-difference method over qradrangular mesh description and numerical u = f u = 0 over Ω = [ 1,1] [ 1,1] (Γ = Ω) I u_h u_h fort u_h contour lines x y x 52 / 102

53 applied to Domain partitioning Function value controls at 4 interfaces description and numerical I yielding 4 sub-domain Dirichlet problems with 2 controlled interfaces each 53 / 102

54 applied to description and numerical I Formal expression over γ 1 (0 x 1; y = 0): s 1 (x) = u y (x,0+ ) u y (x,0 ) = over γ 2 (x = 0; 0 y 1): s 2 (y) = u x (0+,y) u x (0,y) = over γ 3 ( 1 x 0; y = 0): s 3 (x) = u y (x,0+ ) u y (x,0 ) = over γ 4 (x = 0; 1 y 0): s 4 (y) = u x (0+,y) u x (0,y) = Approximation by 2nd-order one-sided finite-differences Interface jumps [ u1 y u ] 4 (x,0); y [ u1 x u ] 2 (0,y); x [ u2 y u ] 3 (x,0); y [ u4 x u ] 3 (0,y). x 54 / 102

55 applied to description and numerical I Functionals and matching condition Interface functionals 1 J i = 2 s2 i w dγ i := J i (v) γ i that is, explicitly: 1 J 1 = J 3 = s 1(x) 2 w(x)dx J 2 = 1 2 s 3(x) 2 w(x)dx J 4 = s 2(y) 2 w(y)dy (w(t) (t [0, 1]) is an optional weighting function, and w( t) = w(t).) Matching condition J 1 = J 2 = J 3 = J 4 = s 4(y) 2 w(y)dy 55 / 102

56 applied to Adjoint problems description and numerical Eight Dirichlet sub-problems p i = 0 (Ω i ) q i = 0 p i = 0 ( Ω i \γ i ) q i = 0 p i = s i w (γ i ) q i = s i+1 w (Ω i ) ( Ω i \γ i+1 ) (γ i+1 ) I Green s formula and s i u i n w = γ i γ i p i n v i + γ i+1 p i n v i+1 s i+1 u i n w = γ i+1 q i n v i + γ i q i n v i+1 γ i+1 56 / 102

57 applied to description and numerical I Functional gradients 1 1 [ ] J 1 = s 1 (x)s 1(x)w(x)dx u = s 1 (x) y u 4 (x,0)w(x)dx y 1 (p 1 q 4 ) 1 = (x,0)v p y 1(x)dx + 0 x (0,y)v 2(y)dy q x (0,y)v 4(y)dy 1 1 [ ] J 2 = s 2 (y)s 2(y)w(y)dy u = s 2 (y) x u 2 (0,y)w(y)dy x 1 q 1 1 = 0 y (x,0)v 1(x)dx (q 1 p 2 ) 0 + (0,y)v p 2 0 x 2(y)dy + 1 y (x,0)v 3(x)dx 0 0 [ ] J 3 = s 3 (x)s 3(x)w(x)dx u = s 3 (x) y u 3 (x,0)w(x)dx y 1 q 0 2 = 0 x (0,y)v 2(y)dy (q 2 p 3 ) 0 + (x,0)v p 3 1 y 3(x)dx 1 x (0,y)v 4(y)dy 0 0 [ ] J 4 = s 4 (y)s 4(y)w(y)dy u = s 4 (y) x u 3 (0,y)w(y)dy x 1 p 0 4 = 0 y (x,0)v 1(x)dx q y (x,0)v 3(x)dx (p 4 q 3 ) + (0,y)v 1 x 4(y)dy Conclusion: J 4 i = j=1 γ j G i,j v j dγ j (i = 1,...,4); {G i,j } : partial gradients. 57 / 102

58 applied to Other technical details description and numerical Dirichlet sub-problems All 12 (4 direct, 4 2 adjoint) sub-problems are solved by direct inverse (discrete sine-transform) I Integrals u h = (Ω X Ω Y ) (Λ X Λ Y ) 1 (Ω X Ω Y ) f h are approximated by the trapezoidal rule 58 / 102

59 applied to description and numerical Reference method applied to agglomerated criterion Gradient J = 4 i=1 J i 4 J = J i i=1 I Iteration Stepsize is set to εj (l) by fixing v (l+1) = v (l) ρ l J (l) δj = J (l).δv (l) J (l) = ρ l ρ l = εj(l) J (l) / 102

60 applied to Quasi-Newton steepest-descent ε = 1 Convergence history Asymptotic Global description and numerical J_1 J_2 J_3 J_4 J = SUM TOTAL J_1 J_2 J_3 J_4 J = SUM TOTAL e-05 1e-06 1e-05 1e-07 I 1e e-08 1e / 102

61 applied to Quasi-Newton steepest descent ε = 1 description and numerical History of gradients over 200 iterations J/ v 1 J/ v DISCRETIZED FUNCTIONAL GRADIENT OF J = SUM_i J_i W.R.T. V1 DISCRETIZED FUNCTIONAL GRADIENT OF J = SUM_i J_i W.R.T. V2 I / 102

62 applied to Quasi-Newton steepest descent ε = 1 description and numerical I History of gradients over 200 iterations J/ v 3 J/ v DISCRETIZED FUNCTIONAL GRADIENT OF J = SUM_i J_i W.R.T. V3 DISCRETIZED FUNCTIONAL GRADIENT OF J = SUM_i J_i W.R.T. V / 102

63 y applied to Quasi-Newton steepest descent ε = 1 Discrete solution description and numerical u_h u_h u_h(x,y) u_h contour lines I x y x 63 / 102

64 applied to Practical determination of minimum-norm element ω description and numerical I ω = 4 i=1 α i u i u i = J i (Y 0 ) using the following parameterization of the convex hull: α 1 = c 1 c 2 c 3 α 2 = (1 c 1 )c 2 c 3 α 3 = (1 c 2 )c 3 α 4 = (1 c 3 ) and (c 1,c 2,c 3 ) [0,1] 3 are discretized by step of Best set of coefficients retained. 64 / 102

65 applied to Stepsize Iteration v (l+1) = v (l) ρ l ω (l) description and numerical Stepsize is set to εj (l) by fixing δj = J (l).δv (l) = ρ l J (l).ω I In practice: ε = 1. εj(l) ρ l = J (l).ω 65 / 102

66 applied to ε = 1 description and numerical Asymptotic Convergence Unscaled u i = J i Scaled u i = J i J i I 66 / 102

67 applied to ε = 1 description and numerical Global Convergence Unscaled u i = J i Scaled u i = J i J i I 67 / 102

68 applied to First conclusions description and numerical I Scaling essential Somewhat deceiving convergence Who is to blame? the insufficiently accurate determination of ω; the non-optimality of the scaling of gradients; the non-optimality of the step-size, the parameter ε being maintained equal to 1 throughout; the large dimension of the design space, here 76 (4 interfaces associated with 19 d.o.f. s). 68 / 102

69 applied to description and numerical I : a direct algorithm Valid when gradients are linearly independent User-supplied scaling factors {S i } (i=1,,n), and scaled gradients J i = J i(y 0 ) S i (S i > 0; e.g. S i = J i for logarithmic gradients). Perform Gram-Schmidt with special normalization Set u 1 = J 1 For i = 2,...,n, set: u i = J i k<i c i,k u k where: ( ) A i J i,u k k < i : c i,k = ), and: (u k,u k 1 c i,k A i = k<i ε i if nonzero otherwise (ε i arbitrary, but small) 69 / 102

70 applied to Consequences Element ω always in the interior of convex hull description and numerical ω = n i=1 1 α i u i α i = u i 2 n j=1 = 1 u j 2 This implies equal projections of ω onto u k 1 < 1 u 1 + i 2 j i u j 2 I and finally: k : ( u k,ω ) = α k u k 2 = ω 2 ( J i,ω ) = ω 2 ( i) (with a possibly-modified scale S i = (1 + ε i )S i ), that is: the same positive constant. EQUAL FRÉCHET DERIVATIVES FORMED WITH SCALED GRADIENTS 70 / 102

71 applied to Geometrical demonstration description and numerical u 3 u U Given J i = J i (Y 0 )/S i, perform G.-S. process to compute u i = [J i k<i c i,k u k ]/A i ; then: O U = ω = OO = ( u1,ω ) = ( u 2,ω ) = ( u 3,ω ) = ω 2 J i = A i u i + k<i c i,k u ( k = J i,ω ) = ( ) A i + c i,k ω 2 = ω 2 = const. k<i }{{} = 1 by normalization I thru A i O u 1 O u 2 71 / 102

72 applied to description and numerical b: automatic rescale Do not let constant A i become negative; instead define A i = 1 c i,k k<i only if this number is strictly-positive. Otherwise, use modified scale: ( S i = c i,k )S i k<i I ( automatic rescale ), so that: c i,k = ( ) 1 c i,k c i,k, k<i c i,k = 1, k<i and set A i = ε i, for some small ε i. Same formal conclusion: the Fréchet derivatives are equal; but the value is much larger, and (at least) one criterion has been rescaled 72 / 102

73 applied to description and numerical Some open questions ω II = ω? Yes, if n = 2 and angle obtuse No, in general. Is the following implication I limω II = 0 = limω = 0 true? (assuming the limit is the result of the iteration) In which order should we perform the Gram-Schmidt process? Etc 73 / 102

74 applied to ε = 1 description and numerical Global convergence Unscaled u i = J i Scaled u i = J i J i I 74 / 102

75 applied to b ε = 1 description and numerical Global convergence Unscaled u i = J i automatic rescale on Scaled u i = J i J i automatic rescale on I 75 / 102

76 applied to : Recap description and numerical Convergence over 500 iterations Unscaled u i = J i automatic rescale off Scaled u i = J i J i automatic rescale on I 76 / 102

77 applied to description and numerical I description and numerical 5 6 I Outline 77 / 102

78 applied to I (Strelitzia reginae) description and numerical I 78 / 102

79 applied to I (Strelitzia reginae) description and numerical I 79 / 102

80 applied to I (Strelitzia reginae) description and numerical u 1 ω u 2 I 80 / 102

81 applied to I (Strelitzia reginae) description and numerical ω I 81 / 102

82 applied to description and numerical I I - definition 1 Start from scaled gradients, {J i }, and duplicates, {g i}, J i = J i(y 0 ) (S i : user-supplied scale) S i and proceed in 3 steps: A,B and C. A: Initialization Set (justified afterwards) u 1 := g 1 = J k / k = Argmax i min j ( J j,j i ) ( J i,j i ) Set n n lower-triangular array c = {c i,j } (i j) to 0. Set, conservatively, I := n (expected number of computed orthogonal basis vectors). Assign some appropriate value to a cut-off constant a : 0 a < 1. (Note: main diagonal of array c is to contain cumulative row-sums.) 82 / 102

83 applied to description and numerical I I - definition 2 B: Main G.-S. loop; for i = 2,3,...,(at most) n, do: 1. Calculate the i 1st column of coefficients: ( ) gj,u i 1 c j,i 1 = ( ) ( j = i,...,n) ui 1,u i 1 and update the cumulative row-sums: 2. Test: c j,j := c j,j + c j,j 1 = c j,k k<i If the following condition is satisfied c j,j > a ( j = i,...,n) ( j = i,...,n) set I := i 1, and interrupt the Gram-Schmidt process (go to 3.). Otherwise, compute next orthogonal vector u i as follows: 83 / 102

84 applied to description and numerical I I - definition 3 - Identify index l = Argmin j {c j,j / i j n} (Note that c l,l a < 1.) - Permute information associated with i and l : g-vectors g i g l, rows i and l of array c and cumulative row-sums c i,i c l,l. - Set A i = 1 c i,i 1 a > 0 (c i,i = former-c l,l a), and calculate (in which g i = former-g l, c i,k = former-c l,k ). u i = g i k<i c i,k u k A i - If u i 0, return to 1. with incremented i; otherwise: g i = c i,k u k = c i,k g k k<i k<i where the {c i,k } are calculated by backward substitution. Then, if c i,k 0 ( k < i): detected: STOP iteration; otherwise (ambiguous exceptional case): STOP G.-S. process; compute original ω and go to C. 84 / 102

85 applied to I - definition 4 description and numerical I 3. Calculate ω as the minimum-norm element in the convex hull of {u 1,u 2,...,u I }: ω = I i=1 α i u i α i = = u i 2 I 1 u j=1 u j i 2 j i (Note that here all computed u i 0, and 0 < α i < 1.) u j 2 C: If ω < TOL, STOP iteration; otherwise, perform descent step and return to B. 85 / 102

86 applied to description and numerical I If I = n : I - consequences I = b + automatic ordering in Gram-Schmidt process making rescale unnecessary since, by construction i : A i 1 a > 0 Otherwise, I < n (incomplete Gram-Schmidt process): First I Fréchet derivatives: ( gi,ω ) = ( u i,ω ) = ω 2 > 0 ( i = 1,...,I) Subsequent ones (i > I): ω = I j=1 ( gi,ω ) = α j u j g i = I k=1 I k=1 c i,k ( uk,ω ) = c i,k u k + v i v i {u 1,u 2,...,u I } I k=1 c i,k ω 2 = c i,i ω 2 > a ω 2 > 0 86 / 102

87 applied to Choice of g 1 Justification description and numerical At initialization, we have set u 1 := g 1 = J k / k = Argmax i min j ( J j,j i ) ( J i,j i ) I This was equivalent to maximizing c 2,1 = c 2,2, that is, maximizing the least cumulative row-sum, at first estimation. (Recall that the favorable situation is when all cumulative row-sums are positive, or > a.) 87 / 102

88 applied to Expected benefits description and numerical I Automatic ordering and rescale When gradients exhibit a general trend, ω is found in fewer steps, and has a larger norm Larger ω implies larger directional derivatives, and greater efficiency of descent step ( gi, ω ) = ω or a ω ω 88 / 102

89 applied to description and numerical I description and numerical 5 6 I Outline 89 / 102

90 applied to description and numerical I When Hessians are known Adressing the question of scaling For objective function J i (Y ) alone, Newton s iteration writes Y 1 = Y 0 p i H i p i = J i (Y 0 ) Split p i into orthogonal components where: q i = ( ) p i, J i (Y 0 ) Define scaled gradient Apply I p i = q i + r i J i (Y 0 ) 2 J i (Y 0 ) and r i J i (Y 0 ) J i = q i Equivalently, set scaling factor as follows: S i = J i (Y ( 0 ) 2 ) p i, J i (Y 0 ) 90 / 102

91 applied to Variant b Use BFGS iterative estimates in place of exact Hessians description and numerical I H (k+1) i = H(k) i k : iteration index ( i = 1,...,n) 1 H(k) i s (k)t H(k) s (k) i in which: H(0) i = Id s (k) T (k)t s H(k) s (k) = Y (k+1) Y (k) i + z (k) i = J i (Y (k+1) ) J i (Y (k) ) z (k)t i 1 (k) z s (k) i z (k)t i 91 / 102

92 applied to Recommended Step-Size J i (Y 0 ρω) = J i (Y 0 ) ρ ( J i (Y 0 ),ω ) ρ2( H i ω,ω ) description and numerical I δj i := J i (Y 0 ) J i (Y 0 ρω) := A i ρ 1 2 B iρ 2, where: A i = S i a i, B i = S i b i, and: a i = ( J i,ω ) b i = ( H i ω,ω ) /S i. If the scales {S i } are truly relevant for the objective-functions ρ = Argmax ρ min δ i. i where: δ i = δj i S i = a i ρ 1 2 b iρ / 102

93 applied to I = n Recommended Step-Size 2 description and numerical I ρ = ρ I := ω 2 b I, where b I = max i I b i is assumed to be positive. I < n ρ II := a ω 2 b II, where b II = max i>i b i is assumed to be positive. Also 2(1 a) ω 2 ρ = b I b II at which point the two bounding parabolas intersect: ω 2 ρ 1 2 b Iρ 2 = a ω 2 ρ 1 2 b IIρ / 102

94 applied to δ a) I = n Recommended Step-Size 3 b) I < n, ρ > max(ρ δ I,ρ II ) ω 2 ω 2 a ω 2 description and numerical δ ρ I c) I < n, ρ (ρ I,ρ II ) ω 2 a ω 2 ρ δ ρ I ρ II ρ d) I < n, ρ < min(ρ I,ρ II ) ω 2 ρ I ρ I ρ ρ II ρ ρ ρ I ρ II ρ 94 / 102

95 applied to Recommended Step-Size end description and numerical If I = n, ρ = ρ I. If I < n, ρ is the element of the triplet {ρ I,ρ II,ρ } which separates the other two. I 95 / 102

96 applied to description and numerical I description and numerical 5 6 I Outline 96 / 102

97 applied to description and numerical I Multiple-Gradient Descent CONCLUSIONS Theory Notion of Pareto stationarity introduced to characterize "standard Pareto-optimal points" : permits to identify a descent direction common to arbitrary number of criteria, knowing the local gradients in number at most equal to the number of design parameters Proof of convergence of to Pareto-stationary points Capability to identify the Pareto front demonstrated for mathematical test-cases; non-convexity of Pareto front not a problem : a direct procedure, faster and more accurate, when gradients are linearly independent; variant b (with automatic rescale) recommended On-going research: scaling, preconditioning, step-size adjustment, what to do to extend to cases of linearly-dependent gradients, / 102

98 applied to CONCLUSIONS Applications description and numerical I Implementation and first experiments Validation with analytical test cases and comparison with a Pareto capturing method (PAES) Extension to Meta-Model-Assisted and Automobile Cooling-System Successful design experiments with 3D compressible flow (Euler and Navier-Stokes) On-going development of meta-model-assisted and testing of 3D external aerodynamics case to be presented at ECCOMAS 2012, Vienna, Austria) 98 / 102

99 applied to CONCLUSIONS Applications description and numerical I Domain-partitioning model problem The method works, but it is not as cost-efficient as the standard quasi-newton method applied to the agglomerated criterion; but the test-case was peculiar in several ways: Pareto front and set reduced to a singleton large dimension of the design space (4 criteria, 76 design variables) robust procedure to adjust the step-size needed determination of ω in the basic method should be made more accurately, perhaps using iterative refinement scaling of gradients found important but never really analyzed completely (logarithmic gradients appropriate for vanishing criteria) general preconditioning not yet clear (on-going) The direct variant found faster and more robust; with automatic rescaling procedure on, b seems to exhibit quadratic convergence. 99 / 102

100 applied to description and numerical I description and numerical 5 6 I Outline 100 / 102

101 applied to Flirting with the Pareto front is combined with a locally-adapted Nash game based on a territory splitting that preserves to second-order the condition Cooperation AND Competition description and numerical I COOPERATION () : The design point is steered to the Pareto front. COMPETITION (Nash game) : at start (from the Pareto front) : α 1 J 1 + α 2 J 2 = 0; then let : J A = α 1 J 1 + α 2 J 2, et J B = J 2, and adapt territory-splitting to best preserve J A ; then : the trajectory remains tangent to the Pareto front. 101 / 102

102 applied to Some references description and numerical I Reports from INRIA Open Archive: M. G. D. A., INRIA Research Report No (2009), Revised version, October 2012 (Open Archive : Related INRIA Research Reports: Nos (2011), 7922 (2012), 7968 (2012), 8068 (2012) Other C. R. Acad. Sci. Paris, Ser. I 350 (2012) / 102

MUTIPLE-GRADIENT DESCENT ALGORITHM FOR MULTIOBJECTIVE OPTIMIZATION

MUTIPLE-GRADIENT DESCENT ALGORITHM FOR MULTIOBJECTIVE OPTIMIZATION Jean-Antoine Désidéri To cite this version: Jean-Antoine Désidéri. MUTIPLE-GRADIENT DESCENT ALGORITHM FOR MULTIOB- JECTIVE OPTIMIZATION.