A multistart multisplit direct search methodology for global optimization

Size: px

Start display at page:

Download "A multistart multisplit direct search methodology for global optimization"

Felicity Carol Tate
6 years ago
Views:

1 1/69 A multistart multisplit direct search methodology for global optimization Ismael Vaz (Univ. Minho) Luis Nunes Vicente (Univ. Coimbra) IPAM, Optimization and Optimal Control for Complex Energy and Property Landscapes October, 2017

2 Outline 2/69 1 LOCAL: Probabilistic direct search. Deterministic case first. Motivation, performance, complexity.

3 Outline 2/69 1 LOCAL: Probabilistic direct search. Deterministic case first. Motivation, performance, complexity. 2 GLOBAL: By multistarting & multispliting probabilistic DS. How to split and merge runs. Use of non-convex modelling.

4 Direct-search methods 3/69 Definition Sample the objective function at a finite number of points at each iteration. Achieve descent by moving in the direction of potentially better points. In the smooth and deterministic case, these points are defined by directions in positive spanning sets (PSS):

5 Direct-search methods 3/69 Definition Sample the objective function at a finite number of points at each iteration. Achieve descent by moving in the direction of potentially better points. In the smooth and deterministic case, these points are defined by directions in positive spanning sets (PSS): D

6 Coordinate search 4/69

7 Coordinate search 4/69

8 Coordinate search 4/69

9 Coordinate search 4/69

10 Coordinate search 4/69

11 Coordinate search 4/69

12 Coordinate search 4/69

13 Coordinate search 4/69

14 Coordinate search 4/69

15 Coordinate search 4/69

16 Coordinate search 4/69

17 Coordinate search 4/69

18 Coordinate search 4/69

19 Coordinate search 4/69

20 A class of DS methods 5/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional)

21 A class of DS methods 5/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional) Poll step: Select D k PSS and find x k + α k d k (d k D k ): f(x k + α k d k ) < f(x k ) ρ(α k ) like ρ(α) = α 2 /2.

22 A class of DS methods 5/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional) Poll step: Select D k PSS and find x k + α k d k (d k D k ): f(x k + α k d k ) < f(x k ) ρ(α k ) like ρ(α) = α 2 /2. SUCCESS: Move x k+1 = x k + α k d k and possibly increase α k+1 = γ α k (γ = 1 or 2). UNSUCCESS: Stay x k+1 = x k and decrease α k+1 = θ α k (θ = 1/2).

23 Deterministic approach 6/69 Positive spanning set (PSS)

24 Deterministic approach 6/69 Positive spanning set (PSS) Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0.

25 Deterministic approach 6/69 Positive spanning set (PSS) cm = 1 n cm = 1 n Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0.

26 Deterministic approach 6/69 Positive spanning set (PSS) f(x k ) f(x k ) cm = 1 n cm = 1 n Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0. Thus d D descent when f(x k ) 0.

27 Deterministic approach 6/69 Positive spanning set (PSS) f(x k ) f(x k ) cm = 1 n cm = 1 n Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0. Thus d D descent when f(x k ) 0. = α k small leads to success!

28 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV

29 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k

30 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k and takes at most k ɛ O ( n ɛ 2) iterations to reduce the gradient below ɛ (0, 1).

31 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k and takes at most k ɛ O ( n ɛ 2) iterations to reduce the gradient below ɛ (0, 1). The # of fevals must be multiplied by n: O ( n 2 ɛ 2).

32 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k and takes at most k ɛ O ( n ɛ 2) iterations to reduce the gradient below ɛ (0, 1). The # of fevals must be multiplied by n: O ( n 2 ɛ 2). Bounds depend on L 2 f (instead of L f as in gradient method).

33 WCC of DS (smooth, convex case) 8/69 Ruling out cases where the supreme distance from the initial level set L f (x 0 ) to the solution set X f is infinite... Theorem (M. Dodangeh and LNV, 2016 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f = O(1/k) k ɛ O ( n ɛ 1). Again, the # of fevals must be multiplied by n: O ( n 2 ɛ 1).

34 WCC of DS (smooth, convex case) 8/69 Ruling out cases where the supreme distance from the initial level set L f (x 0 ) to the solution set X f is infinite... Theorem (M. Dodangeh and LNV, 2016 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f = O(1/k) k ɛ O ( n ɛ 1). Again, the # of fevals must be multiplied by n: O ( n 2 ɛ 1). The n 2 factor comes from D cm(d) 2.

WCC of DS (smooth, convex case) 8/69 Ruling out cases where the supreme distance from the initial level set L f (x 0 ) to the solution set X f is infinite... Theorem (M. Dodangeh and LNV, 2016 Math.

35 WCC of DS (smooth, convex case) 8/69 Ruling out cases where the supreme distance from the initial level set L f (x 0 ) to the solution set X f is infinite... Theorem (M. Dodangeh and LNV, 2016 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f = O(1/k) k ɛ O ( n ɛ 1). Again, the # of fevals must be multiplied by n: O ( n 2 ɛ 1). The n 2 factor comes from D cm(d) 2. For D = D one obtains 2n (1/ n) 2 = 2n2. Is this optimal?

36 Optimality of the n 2 factor 9/69 Theorem (M. Dodangeh, LNV, and Z. Zhang, 2016 Optim. Lett.) The factor n 2 is optimal since any PSS D in R n satisfies D cm(d) 2 1 ζ 2 n 2

37 Optimality of the n 2 factor 9/69 Theorem (M. Dodangeh, LNV, and Z. Zhang, 2016 Optim. Lett.) The factor n 2 is optimal since any PSS D in R n satisfies D cm(d) 2 1 ζ 2 n 2 Plot of 8 = D cm(d ) 2 4 = n 2 D cm(d) 2 n2. for the case n = 2 and D s with uniform angles, m = D

38 Global rate of DS (smooth, strongly convex case) 10/69 Theorem (M. Dodangeh and LNV, 2014 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f < Cr k, where r (0, 1) and C > 0.

39 Global rate of DS (smooth, strongly convex case) 10/69 Theorem (M. Dodangeh and LNV, 2014 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f < Cr k, where r (0, 1) and C > 0. When f is SC (constant µ > 0), one has (k 0 first unsucc.) x k x L f /µ x k0 x.

40 Global rate of DS (smooth, strongly convex case) 10/69 Theorem (M. Dodangeh and LNV, 2014 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f < Cr k, where r (0, 1) and C > 0. When f is SC (constant µ > 0), one has (k 0 first unsucc.) x k x L f /µ x k0 x. A linear rate for the iterates can be derived from 1 2 µ x x 2 f(x) f.

41 Difficulties in the nonsmooth case 11/69 The cone of descent directions at the poll center is shaded.

42 One possible fix: Combine DS with Smoothing 12/69

43 One possible fix: Combine DS with Smoothing 12/69

44 One possible fix: Combine DS with Smoothing 12/69 Essentially the WCC cost increases from O ( ɛ 2) to O ( ɛ 3) [R. Garmanjani and LNV, 2013 IMA J. Numer. Anal.].

45 One possible fix: Combine DS with Smoothing 12/69 Essentially the WCC cost increases from O ( ɛ 2) to O ( ɛ 3) [R. Garmanjani and LNV, 2013 IMA J. Numer. Anal.]. The # of function evaluations increases from O ( ) n 2 ɛ 2) to O (n 5 2 ɛ 3.

46 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates:

47 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates.

48 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f.

49 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f. O(ɛ 2 ) non-convex global rate 1/ k for f.

50 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f. O(ɛ 2 ) non-convex global rate 1/ k for f. In terms of function evaluations: O(n 2 ɛ 1 ), O(n 2 ɛ 2 ). The factor n 2 is proved approximately optimal.

51 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f. O(ɛ 2 ) non-convex global rate 1/ k for f. In terms of function evaluations: O(n 2 ɛ 1 ), O(n 2 ɛ 2 ). The factor n 2 is proved approximately optimal. O(ɛ 3 ) non-smooth, non-convex (using smoothing techniques).

52 Descent condition and successful iterations 14/69 Assume the polling directions are normalized. Lemma If cm ( D k, f(x k ) ) κ and α k < 2κ f(x k) L f + 1, the k-th iteration is successful.

53 Descent condition and successful iterations 14/69 Assume the polling directions are normalized. Lemma If cm ( D k, f(x k ) ) κ and α k < 2κ f(x k) L f + 1, the k-th iteration is successful. where cm(d, v) is the cosine measure of D given v, defined by cm(d, v) = max d D and L f is a Lipschitz constant of f. d v d v

54 Randomly generating positive spanning sets... 15/69 f(x k )

55 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS

56 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS f(x k )

57 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS f(x k ) n random polling directions certainly not a PSS...

58 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS f(x k ) n random polling directions certainly not a PSS... cm ( D k, f(x k ) ) κ can be satisfied probabilistically...

59 Numerical illustration 16/69 Relative performance for different sets of polling directions (n = 40). [I I] [Q Q] 2n n + 1 n/4 2 1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Solution accuracy was Averages were taken over 10 independent runs.

60 Probabilistic descent 17/69 From the definition of probabilistic models (Bandeira, LNV, Scheinberg, 2014 SIOPT): Definition The sequence {D k } is (p, κ)-probabilistically descent if, for each k 0, Pr ( cm(d k, f(x k )) κ D 0,..., D k 1 ) p,

61 Probabilistic descent 17/69 From the definition of probabilistic models (Bandeira, LNV, Scheinberg, 2014 SIOPT): Definition The sequence {D k } is (p, κ)-probabilistically descent if, for each k 0, Pr ( cm(d k, f(x k )) κ D 0,..., D k 1 ) p, Let Z k be the indicator function of { cm ( D k, f(x k ) ) κ }, and p 0 = ln θ ln(γ 1 θ) = 1 2 θ = 1/2, γ = 2.

62 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ),

63 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ), k ɛ : the smallest integer such that f(x k ) ɛ.

64 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ), k ɛ : the smallest integer such that f(x k ) ɛ. Denote the corresponding random variables by G k and K ɛ.

65 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ), k ɛ : the smallest integer such that f(x k ) ɛ. Denote the corresponding random variables by G k and K ɛ. Let z l denote the realization of Z l = { cm ( D l, f(x l ) ) κ } (l 0). Intuition: If g k is big, then k 1 l=0 z l is probably small.

66 Global rate: Counting descent 19/69 In fact, one can prove k 1 z l l=0 ( ) 1 O g k 2 + p 0 k.

67 Global rate: Counting descent 19/69 In fact, one can prove k 1 z l l=0 ( ) 1 O g k 2 + p 0 k. It then results, { G } k > ɛ { k 1 Z l l=0 } [ ( 1 ) O kɛ 2 + p 0 ]k.

68 Global rate: Counting descent 19/69 In fact, one can prove k 1 z l l=0 ( ) 1 O g k 2 + p 0 k. It then results, { G } k > ɛ { k 1 Z l l=0 } [ ( 1 ) O kɛ 2 + p 0 ]k. λ

69 Global rate: Counting descent In fact, one can prove k 1 l=0 It then results, { G } k > ɛ z l ( ) 1 O g k 2 + p 0 k. { k 1 Z l l=0 } [ ( 1 ) O kɛ 2 + p 0 ]k. λ Hence Pr ( G k ɛ ) = 1 Pr ( G k > ɛ ) 1 Pr ( k 1 ) Z l λ k l=0 } {{ } apply Chernoff & Submartingale Theory. 19/69

70 Global rate and WCC bound 20/69 Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k

71 Global rate and WCC bound 20/69 Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k O(1/ k) sublinear rate with overwhelmingly high probability.

72 Global rate and WCC bound 20/69 Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k O(1/ k) sublinear rate with overwhelmingly high probability. Since Pr(K ɛ k) = Pr( G k ɛ), we also get: Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( ) ) ɛ 2 Pr K ɛ O 1 exp [ O(ɛ 2 ) ]. κ 2

73 Global rate and WCC bound Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k O(1/ k) sublinear rate with overwhelmingly high probability. Since Pr(K ɛ k) = Pr( G k ɛ), we also get: Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( ) ) ɛ 2 Pr K ɛ O 1 exp [ O(ɛ 2 ) ]. κ 2 O(ɛ 2 ) bound for # of iter. with overwhelmingly high probability. 20/69

74 Two uniform directions are enough, one is not 21/69 g

75 Two uniform directions are enough, one is not 21/69 g d 1 d 1 U(S 1 ) κ (0, 1), ( ) Pr cm (d 1, g) = d 1 g κ < 1/2.

76 Two uniform directions are enough, one is not 21/69 g d 1 d 1 U(S 1 ) κ (0, 1), ( ) Pr cm (d 1, g) = d 1 g κ < 1/2.

77 Two uniform directions are enough, one is not 21/69 g d 1 g d 1 d 2 d 1 U(S 1 ) κ (0, 1), ( ) Pr cm (d 1, g) = d 1 g κ < 1/2. d 1, d 2 U(S 1 ) κ (0, 1), Pr (cm ({d 1, d 2 }, g) κ ) > 1/2.

78 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n.

79 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains

80 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains WCC (number of function evaluations) ( Pr Kɛ f O ( nɛ 2) ) r 1 exp [ O(ɛ 2 ) ].

81 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains WCC (number of function evaluations) ( Pr Kɛ f O ( nɛ 2) ) r 1 exp [ O(ɛ 2 ) ]. The WCC bound is O(rnɛ 2 ), better than when O(n 2 ɛ 2 ) when r n.

82 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains WCC (number of function evaluations) ( Pr Kɛ f O ( nɛ 2) ) r 1 exp [ O(ɛ 2 ) ]. The WCC bound is O(rnɛ 2 ), better than when O(n 2 ɛ 2 ) when r n. Theory admits r = 2, leading to O(nɛ 2 )!!!

83 A more detailed look at the numerical experiments 23/69 Relative performance for different sets of polling directions (n = 40). [I I] [Q Q] 2 (γ = 2) 4 (γ = 1.1) arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Now γ = 1 for [I I] and [Q Q].

84 A more detailed look at the numerical experiments 24/69 Relative performance for different sets of polling directions (n = 100). [I I] [Q Q] 2 (γ = 2) 4 (γ = 1.1) arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Now γ = 1 for [I I] and [Q Q].

85 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice:

86 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice: Given v R n with v = 1 and κ [0, 1], Pr ( cm(d k, v) κ ) = Pr ( d v κ ) + Pr ( d v κ ) = 2ϱ. with ϱ = Pr ( {d v κ} ).

87 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice: Given v R n with v = 1 and κ [0, 1], Pr ( cm(d k, v) κ ) = Pr ( d v κ ) + Pr ( d v κ ) = 2ϱ. with ϱ = Pr ( {d v κ} ). Given any γ, θ satisfying 0 < θ < 1 < γ, we can pick κ > 0 sufficiently small so that 2ϱ > p 0 = (ln θ)/[ln(γ 1 θ)].

88 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice: Given v R n with v = 1 and κ [0, 1], Pr ( cm(d k, v) κ ) = Pr ( d v κ ) + Pr ( d v κ ) = 2ϱ. with ϱ = Pr ( {d v κ} ). Given any γ, θ satisfying 0 < θ < 1 < γ, we can pick κ > 0 sufficiently small so that 2ϱ > p 0 = (ln θ)/[ln(γ 1 θ)]. In addition, for any D = {d 1, d 2 } u.d. in the unit sphere, Pr ( cm(d, v) κ ) = 2ϱ Pr ( {d 1 v κ} {d 2 v κ} ) 2ϱ.

89 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which

90 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which the new iterate depends on some object (directions, models), the quality of the object is favorable with a certain probability.

91 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which the new iterate depends on some object (directions, models), the quality of the object is favorable with a certain probability. The technique is based on: counting the number of iterations for which the quality is favorable, examining the probabilistic behavior of this number.

92 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which the new iterate depends on some object (directions, models), the quality of the object is favorable with a certain probability. The technique is based on: counting the number of iterations for which the quality is favorable, examining the probabilistic behavior of this number. What we obtain is a global rate of O(1/ k), with overwhelmingly high probability.

93 Examples of application 27/69 This technique has also been applied to: Trust-region methods based on probabilistic models [Gratton, Royer, LNV, and Zhang, 2017 IMAJNA]. When models are built iteratively in some random fashion and exhibit good accuracy with sufficiently high probability.

94 Examples of application 27/69 This technique has also been applied to: Trust-region methods based on probabilistic models [Gratton, Royer, LNV, and Zhang, 2017 IMAJNA]. When models are built iteratively in some random fashion and exhibit good accuracy with sufficiently high probability. Problems with linear constraints [Gratton, Royer, LNV, and Zhang, 2017]. Random generation can be efficiently done in subspaces identified within approximate tangent cones.

95 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this:

96 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this: Relying on probabilistic direct search which is an efficient, simple, and rigorous local solver.

97 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this: Relying on probabilistic direct search which is an efficient, simple, and rigorous local solver. Taking advantage of parallel computing.

98 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this: Relying on probabilistic direct search which is an efficient, simple, and rigorous local solver. Taking advantage of parallel computing. Having an approach capable of identifying several global or local minimizers of interest.

99 REMEMBER: Probabilistic DS (polling with 2 directions) 29/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional) Poll step: Select D k = [d, d] RANDOMLY and find x k + α k d k (d k D k ): f(x k + α k d k ) < f(x k ) ρ(α k ) like ρ(α) = α 2 /2. SUCCESS: Move x k+1 = x k + α k d k and increase α k+1 = γ α k (γ = 2). UNSUCCESS: Stay x k+1 = x k and decrease α k+1 = θ α k (θ = 1/2).

100 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA:

101 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel).

102 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated.

103 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters.

104 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters. IN THIS WAY:

105 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters. IN THIS WAY: We parallelize DS runs (where poll steps are typically limited to 2n evaluations).

106 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters. IN THIS WAY: We parallelize DS runs (where poll steps are typically limited to 2n evaluations). We allow MULTISTART runs and SPLITTING/MERGING of runs.

107 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0).

108 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l:

109 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l: Create R l clusters C l j, j = 1,..., R l (using all the history of points).

110 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l: Create R l clusters C l j, j = 1,..., R l (using all the history of points). Decide whether any DS run is split or merged. Let, j = 1,..., R l+1 be the new poll centers. x l+1 j

111 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l: Create R l clusters C l j, j = 1,..., R l (using all the history of points). Decide whether any DS run is split or merged. Let, j = 1,..., R l+1 be the new poll centers. x l+1 j INNER ITERATIONS: Perform (in parallel) a certain number of iterations for all the R l+1 DS runs, starting from x l+1 j, j = 1,..., R l+1.

112 Requirements and approaches 32/69 To align with typical global convergence requirements:

113 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite.

114 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules.

115 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value.

116 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value. We consider two approaches for the OUTER iterations:

117 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value. We consider two approaches for the OUTER iterations: Using space clustering.

118 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value. We consider two approaches for the OUTER iterations: Using space clustering. Using nonconvex models built from previous function values.

119 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points).

120 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points). If a cluster contains more than two poll centers, remove the DS run corresponding to the poll center with worst objective value.

121 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points). If a cluster contains more than two poll centers, remove the DS run corresponding to the poll center with worst objective value. If a cluster contains no poll center, then attempt to start a new DS run from the centroid of the cluster.

122 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points). If a cluster contains more than two poll centers, remove the DS run corresponding to the poll center with worst objective value. If a cluster contains no poll center, then attempt to start a new DS run from the centroid of the cluster. POSSIBLE DRAWBACK: The kmeans approach does not take into account the previous objective function values (clustering is done in the variables space).

123 The space clustering approach (example) 34/ D Ackley problem - kmeans Cluster 1 Cluster 2 Poll centers Centroid Two poll centers in the same cluster Centroid

124 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, 2006.

125 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, Given a sample set Y = {y 1,..., y np }, a nonconvex piecewise quadratic underestimate of f (with n q quadratics) can be obtained by solving:

126 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, Given a sample set Y = {y 1,..., y np }, a nonconvex piecewise quadratic underestimate of f (with n q quadratics) can be obtained by solving: max p n p q(p; y j ) s.t. q(p; y j ) f(y j ), j = 1,..., n p, j=1 with p containing all the quadratics coefficients p i = (a i, c i, H i ) in

127 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, Given a sample set Y = {y 1,..., y np }, a nonconvex piecewise quadratic underestimate of f (with n q quadratics) can be obtained by solving: max p n p q(p; y j ) s.t. q(p; y j ) f(y j ), j = 1,..., n p, j=1 with p containing all the quadratics coefficients p i = (a i, c i, H i ) in { q(p; y) = a i + c i y + 1 } 2 y H i y = q i (y), min 1 i n q and where the H i s must be assured symmetric and PD.

128 A nonconvex piecewise quadratic model (example) 36/69 Nonconvex piecewise underestimate 3 sin 2.5 q q 2 q 3 q 4 q

129 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p.

130 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p. Still, this optimization problem has several drawbacks: It is nonconvex, and has to be solved itself for global optimality to ensure effective underestimation.

131 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p. Still, this optimization problem has several drawbacks: It is nonconvex, and has to be solved itself for global optimality to ensure effective underestimation. The # of variables is excessive: n p (# of auxiliary variables) + (n + 1)(n + 2)/2 n q (quadratic coefficients # of quadratics).

132 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p. Still, this optimization problem has several drawbacks: It is nonconvex, and has to be solved itself for global optimality to ensure effective underestimation. The # of variables is excessive: n p (# of auxiliary variables) + (n + 1)(n + 2)/2 n q (quadratic coefficients # of quadratics). # of constraints also high when imposing PD on all quadratics.

133 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model.

134 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model. Or one can think of taking a subset of n p n p points and considering n q = 1, i.e., fitting only one quadratic q i, resulting in the following LP:

135 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model. Or one can think of taking a subset of n p n p points and considering n q = 1, i.e., fitting only one quadratic q i, resulting in the following LP: max p i,γ n p j=1 γ j s.t. q i (y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), j = 1,..., n p.

136 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model. Or one can think of taking a subset of n p n p points and considering n q = 1, i.e., fitting only one quadratic q i, resulting in the following LP: max p i,γ n p j=1 γ j s.t. q i (y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), j = 1,..., n p. The # of variables reduces to n p (# of auxiliary variables) + (n + 1)(n + 2)/2 (one quadratic).

137 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points.

138 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them).

139 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points.

140 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points. Assess how well the quadratic models fit the whole data: θ i,j = (n p) j l=1 ( f(y j l ) q i(y j l ) f(y j l ) + 1 ) 2, i, j = 1,..., R.

141 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points. Assess how well the quadratic models fit the whole data: θ i,j = (n p) j l=1 ( f(y j l ) q i(y j l ) f(y j l ) + 1 ) 2, i, j = 1,..., R. If θ i,i is large, then split the DS run i (the quadratic i does not fit well cluster i).

142 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points. Assess how well the quadratic models fit the whole data: θ i,j = (n p) j l=1 ( f(y j l ) q i(y j l ) f(y j l ) + 1 ) 2, i, j = 1,..., R. If θ i,i is large, then split the DS run i (the quadratic i does not fit well cluster i). In addition, if θ i,j, i j, is small, then merge DS runs i and j, using the poll center with lowest objective value.

143 The nonconvex modeling approach (example) 40/69 Nonconvex piecewise underestimate 3 sin 2.5 q q 2 q 3 q 4 q θ i,j values: clusters or runs q q q q

144 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is:

145 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima.

146 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems.

147 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned.

148 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned. The results are presented using performance profiles.

149 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned. The results are presented using performance profiles. The solver starts with 10 poll centers in serial. A maximum of 10 simultaneous runs is imposed.

150 Numerical results We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned. The results are presented using performance profiles. The solver starts with 10 poll centers in serial. A maximum of 10 simultaneous runs is imposed. A comparison is made with PSwarm (a direct search solver that uses a particle swarm heuristic as a search step). 41/69

151 Performance profiles (# function evaluations) 42/69 1 Number of function evaluations MMDS (2n dirs, kmeans) MMDS (2n dirs, nonconvex) MMDS (2 opposite, kmeans) MMDS (2 opposite, nonconvex) PSwarm

152 Performance profiles (quality of final f) 43/69 1 Objective function values MMDS (2n dirs, kmeans) MMDS (2n dirs, nonconvex) MMDS (2 opposite, kmeans) MMDS (2 opposite, nonconvex) PSwarm

153 Numerical results (application problem) 44/69 Is this approach capable of computing different global minima for an application problem?

154 Numerical results (application problem) 44/69 Is this approach capable of computing different global minima for an application problem? Consider optimizing the orientation of a part, to be built by additive manufacturing (3D printer), where the staircase effect is to be minimized: min (θ x,θ y) j h 2 A j d.n j 2 if d.n j 1, 0 otherwise, where h is the slicing thickness, A j is the area of the j triangle, d is the slicing direction, and n j is the j triangle normal.

155 Numerical results (application problem) 44/69 Is this approach capable of computing different global minima for an application problem? Consider optimizing the orientation of a part, to be built by additive manufacturing (3D printer), where the staircase effect is to be minimized: min (θ x,θ y) j h 2 A j d.n j 2 if d.n j 1, 0 otherwise, where h is the slicing thickness, A j is the area of the j triangle, d is the slicing direction, and n j is the j triangle normal. This is a 2D problem where (θ x, θ y ) [0, 180] 2 are the rotation degrees along the x-axis and y-axis, respectively, in a three dimension space.

156 A used part 45/69 We consider an aerospace complex part for numerical testing. Nearly triangles are needed to define the part.

157 The part/object to be considered 46/69 Slices are taken along the z axis with 5mm high, resulting in 64 slices. Before optimization

158 Objective function landscape 47/

159 Numerical results (space clustering) 48/

160 Numerical results (space clustering) 49/

161 Numerical results (space clustering) 50/

162 Numerical results (space clustering) 51/

163 Numerical results (space clustering) 52/

164 Numerical results (space clustering) 53/69 The solver stops at (θ x, θ y ) = (90, 180), with f = e + 05, taking 124 function evaluations.

165 Numerical results (space clustering) 53/69 The solver stops at (θ x, θ y ) = (90, 180), with f = e + 05, taking 124 function evaluations. The space clustering approach determined only one of the global minimizers!

166 54/69 Numerical results (nonconvex modeling) q

167 55/69 Numerical results (nonconvex modeling) q 1,q

168 Numerical results (nonconvex modeling) min of q s 56/

169 Numerical results (nonconvex modeling) min of q s 57/

170 Numerical results (nonconvex modeling) min of q s 58/

171 Numerical results (nonconvex modeling) min of q s 59/

172 Numerical results (nonconvex modeling) min of q s 60/

173 Numerical results (nonconvex modeling) min of q s /69

174 Numerical results (nonconvex modeling) min of q s 62/

175 Numerical results (nonconvex modeling) min of q s 63/

176 Numerical results (nonconvex modeling) min of q s 64/

177 Numerical results (nonconvex modeling) min of q s 65/

178 Numerical results (nonconvex modeling) min of q s 66/

179 Numerical results (nonconvex modeling) min of q s 67/

180 Numerical results (nonconvex modeling) 68/69 The solver stops now at (θ x, θ y ) = (90, 180) and (θ x, θ y ) = (90, 0), both with f = e + 05, taking 2152 function evaluations.

181 Numerical results (nonconvex modeling) 68/69 The solver stops now at (θ x, θ y ) = (90, 180) and (θ x, θ y ) = (90, 0), both with f = e + 05, taking 2152 function evaluations. The nonconvex modeling approach determined both global minimizers of the problem!

182 The result of Optimization 69/69 Slices are taken along the z axis with 5mm high. After optimization

Direct Search Based on Probabilistic Descent

Direct Search Based on Probabilistic Descent S. Gratton C. W. Royer L. N. Vicente Z. Zhang January 22, 2015 Abstract Direct-search methods are a class of popular derivative-free algorithms characterized