A multistart multisplit direct search methodology for global optimization
|
|
- Felicity Carol Tate
- 6 years ago
- Views:
Transcription
1 1/69 A multistart multisplit direct search methodology for global optimization Ismael Vaz (Univ. Minho) Luis Nunes Vicente (Univ. Coimbra) IPAM, Optimization and Optimal Control for Complex Energy and Property Landscapes October, 2017
2 Outline 2/69 1 LOCAL: Probabilistic direct search. Deterministic case first. Motivation, performance, complexity.
3 Outline 2/69 1 LOCAL: Probabilistic direct search. Deterministic case first. Motivation, performance, complexity. 2 GLOBAL: By multistarting & multispliting probabilistic DS. How to split and merge runs. Use of non-convex modelling.
4 Direct-search methods 3/69 Definition Sample the objective function at a finite number of points at each iteration. Achieve descent by moving in the direction of potentially better points. In the smooth and deterministic case, these points are defined by directions in positive spanning sets (PSS):
5 Direct-search methods 3/69 Definition Sample the objective function at a finite number of points at each iteration. Achieve descent by moving in the direction of potentially better points. In the smooth and deterministic case, these points are defined by directions in positive spanning sets (PSS): D
6 Coordinate search 4/69
7 Coordinate search 4/69
8 Coordinate search 4/69
9 Coordinate search 4/69
10 Coordinate search 4/69
11 Coordinate search 4/69
12 Coordinate search 4/69
13 Coordinate search 4/69
14 Coordinate search 4/69
15 Coordinate search 4/69
16 Coordinate search 4/69
17 Coordinate search 4/69
18 Coordinate search 4/69
19 Coordinate search 4/69
20 A class of DS methods 5/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional)
21 A class of DS methods 5/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional) Poll step: Select D k PSS and find x k + α k d k (d k D k ): f(x k + α k d k ) < f(x k ) ρ(α k ) like ρ(α) = α 2 /2.
22 A class of DS methods 5/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional) Poll step: Select D k PSS and find x k + α k d k (d k D k ): f(x k + α k d k ) < f(x k ) ρ(α k ) like ρ(α) = α 2 /2. SUCCESS: Move x k+1 = x k + α k d k and possibly increase α k+1 = γ α k (γ = 1 or 2). UNSUCCESS: Stay x k+1 = x k and decrease α k+1 = θ α k (θ = 1/2).
23 Deterministic approach 6/69 Positive spanning set (PSS)
24 Deterministic approach 6/69 Positive spanning set (PSS) Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0.
25 Deterministic approach 6/69 Positive spanning set (PSS) cm = 1 n cm = 1 n Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0.
26 Deterministic approach 6/69 Positive spanning set (PSS) f(x k ) f(x k ) cm = 1 n cm = 1 n Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0. Thus d D descent when f(x k ) 0.
27 Deterministic approach 6/69 Positive spanning set (PSS) f(x k ) f(x k ) cm = 1 n cm = 1 n Cosine measure of a PSS D cm(d) = min max d v 0 v R n d D d v > 0. Thus d D descent when f(x k ) 0. = α k small leads to success!
28 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV
29 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k
30 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k and takes at most k ɛ O ( n ɛ 2) iterations to reduce the gradient below ɛ (0, 1).
31 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k and takes at most k ɛ O ( n ɛ 2) iterations to reduce the gradient below ɛ (0, 1). The # of fevals must be multiplied by n: O ( n 2 ɛ 2).
32 WCC of DS (smooth case) 7/69 Insight: (decrease in f) O(α 2 k ) }{{} success Theorem (LNV, 2013 EURO J. Comp. Optim.) O(α 2 k u ) O( f(x ku ) 2 ) }{{} unsuccess Kolda, Lewis, Torczon, 2003 SIREV Any such DS method generates a sequence {x k } k 0 such that: min f(x j) = O(1/ k) 0 j k and takes at most k ɛ O ( n ɛ 2) iterations to reduce the gradient below ɛ (0, 1). The # of fevals must be multiplied by n: O ( n 2 ɛ 2). Bounds depend on L 2 f (instead of L f as in gradient method).
33 WCC of DS (smooth, convex case) 8/69 Ruling out cases where the supreme distance from the initial level set L f (x 0 ) to the solution set X f is infinite... Theorem (M. Dodangeh and LNV, 2016 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f = O(1/k) k ɛ O ( n ɛ 1). Again, the # of fevals must be multiplied by n: O ( n 2 ɛ 1).
34 WCC of DS (smooth, convex case) 8/69 Ruling out cases where the supreme distance from the initial level set L f (x 0 ) to the solution set X f is infinite... Theorem (M. Dodangeh and LNV, 2016 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f = O(1/k) k ɛ O ( n ɛ 1). Again, the # of fevals must be multiplied by n: O ( n 2 ɛ 1). The n 2 factor comes from D cm(d) 2.
35 WCC of DS (smooth, convex case) 8/69 Ruling out cases where the supreme distance from the initial level set L f (x 0 ) to the solution set X f is infinite... Theorem (M. Dodangeh and LNV, 2016 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f = O(1/k) k ɛ O ( n ɛ 1). Again, the # of fevals must be multiplied by n: O ( n 2 ɛ 1). The n 2 factor comes from D cm(d) 2. For D = D one obtains 2n (1/ n) 2 = 2n2. Is this optimal?
36 Optimality of the n 2 factor 9/69 Theorem (M. Dodangeh, LNV, and Z. Zhang, 2016 Optim. Lett.) The factor n 2 is optimal since any PSS D in R n satisfies D cm(d) 2 1 ζ 2 n 2
37 Optimality of the n 2 factor 9/69 Theorem (M. Dodangeh, LNV, and Z. Zhang, 2016 Optim. Lett.) The factor n 2 is optimal since any PSS D in R n satisfies D cm(d) 2 1 ζ 2 n 2 Plot of 8 = D cm(d ) 2 4 = n 2 D cm(d) 2 n2. for the case n = 2 and D s with uniform angles, m = D
38 Global rate of DS (smooth, strongly convex case) 10/69 Theorem (M. Dodangeh and LNV, 2014 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f < Cr k, where r (0, 1) and C > 0.
39 Global rate of DS (smooth, strongly convex case) 10/69 Theorem (M. Dodangeh and LNV, 2014 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f < Cr k, where r (0, 1) and C > 0. When f is SC (constant µ > 0), one has (k 0 first unsucc.) x k x L f /µ x k0 x.
40 Global rate of DS (smooth, strongly convex case) 10/69 Theorem (M. Dodangeh and LNV, 2014 Math. Program.) Any such DS method generates a sequence {x k } k 0 such that: f(x k ) f < Cr k, where r (0, 1) and C > 0. When f is SC (constant µ > 0), one has (k 0 first unsucc.) x k x L f /µ x k0 x. A linear rate for the iterates can be derived from 1 2 µ x x 2 f(x) f.
41 Difficulties in the nonsmooth case 11/69 The cone of descent directions at the poll center is shaded.
42 One possible fix: Combine DS with Smoothing 12/69
43 One possible fix: Combine DS with Smoothing 12/69
44 One possible fix: Combine DS with Smoothing 12/69 Essentially the WCC cost increases from O ( ɛ 2) to O ( ɛ 3) [R. Garmanjani and LNV, 2013 IMA J. Numer. Anal.].
45 One possible fix: Combine DS with Smoothing 12/69 Essentially the WCC cost increases from O ( ɛ 2) to O ( ɛ 3) [R. Garmanjani and LNV, 2013 IMA J. Numer. Anal.]. The # of function evaluations increases from O ( ) n 2 ɛ 2) to O (n 5 2 ɛ 3.
46 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates:
47 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates.
48 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f.
49 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f. O(ɛ 2 ) non-convex global rate 1/ k for f.
50 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f. O(ɛ 2 ) non-convex global rate 1/ k for f. In terms of function evaluations: O(n 2 ɛ 1 ), O(n 2 ɛ 2 ). The factor n 2 is proved approximately optimal.
51 Summary of DS global rates 13/69 Imposing sufficient decrease to accept new iterates: O( log(ɛ)) strongly convex linear global rate for f, f, and absolute error in iterates. O(ɛ 1 ) convex global rate 1/k for f and f. O(ɛ 2 ) non-convex global rate 1/ k for f. In terms of function evaluations: O(n 2 ɛ 1 ), O(n 2 ɛ 2 ). The factor n 2 is proved approximately optimal. O(ɛ 3 ) non-smooth, non-convex (using smoothing techniques).
52 Descent condition and successful iterations 14/69 Assume the polling directions are normalized. Lemma If cm ( D k, f(x k ) ) κ and α k < 2κ f(x k) L f + 1, the k-th iteration is successful.
53 Descent condition and successful iterations 14/69 Assume the polling directions are normalized. Lemma If cm ( D k, f(x k ) ) κ and α k < 2κ f(x k) L f + 1, the k-th iteration is successful. where cm(d, v) is the cosine measure of D given v, defined by cm(d, v) = max d D and L f is a Lipschitz constant of f. d v d v
54 Randomly generating positive spanning sets... 15/69 f(x k )
55 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS
56 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS f(x k )
57 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS f(x k ) n random polling directions certainly not a PSS...
58 Randomly generating positive spanning sets... 15/69 f(x k ) n + 1 random polling directions in this case not a PSS f(x k ) n random polling directions certainly not a PSS... cm ( D k, f(x k ) ) κ can be satisfied probabilistically...
59 Numerical illustration 16/69 Relative performance for different sets of polling directions (n = 40). [I I] [Q Q] 2n n + 1 n/4 2 1 arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Solution accuracy was Averages were taken over 10 independent runs.
60 Probabilistic descent 17/69 From the definition of probabilistic models (Bandeira, LNV, Scheinberg, 2014 SIOPT): Definition The sequence {D k } is (p, κ)-probabilistically descent if, for each k 0, Pr ( cm(d k, f(x k )) κ D 0,..., D k 1 ) p,
61 Probabilistic descent 17/69 From the definition of probabilistic models (Bandeira, LNV, Scheinberg, 2014 SIOPT): Definition The sequence {D k } is (p, κ)-probabilistically descent if, for each k 0, Pr ( cm(d k, f(x k )) κ D 0,..., D k 1 ) p, Let Z k be the indicator function of { cm ( D k, f(x k ) ) κ }, and p 0 = ln θ ln(γ 1 θ) = 1 2 θ = 1/2, γ = 2.
62 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ),
63 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ), k ɛ : the smallest integer such that f(x k ) ɛ.
64 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ), k ɛ : the smallest integer such that f(x k ) ɛ. Denote the corresponding random variables by G k and K ɛ.
65 Global rate: Counting descent 18/69 For each realization of the DS algorithm, define g k : the gradient with minimum norm among f(x 0 ),..., f(x k ), k ɛ : the smallest integer such that f(x k ) ɛ. Denote the corresponding random variables by G k and K ɛ. Let z l denote the realization of Z l = { cm ( D l, f(x l ) ) κ } (l 0). Intuition: If g k is big, then k 1 l=0 z l is probably small.
66 Global rate: Counting descent 19/69 In fact, one can prove k 1 z l l=0 ( ) 1 O g k 2 + p 0 k.
67 Global rate: Counting descent 19/69 In fact, one can prove k 1 z l l=0 ( ) 1 O g k 2 + p 0 k. It then results, { G } k > ɛ { k 1 Z l l=0 } [ ( 1 ) O kɛ 2 + p 0 ]k.
68 Global rate: Counting descent 19/69 In fact, one can prove k 1 z l l=0 ( ) 1 O g k 2 + p 0 k. It then results, { G } k > ɛ { k 1 Z l l=0 } [ ( 1 ) O kɛ 2 + p 0 ]k. λ
69 Global rate: Counting descent In fact, one can prove k 1 l=0 It then results, { G } k > ɛ z l ( ) 1 O g k 2 + p 0 k. { k 1 Z l l=0 } [ ( 1 ) O kɛ 2 + p 0 ]k. λ Hence Pr ( G k ɛ ) = 1 Pr ( G k > ɛ ) 1 Pr ( k 1 ) Z l λ k l=0 } {{ } apply Chernoff & Submartingale Theory. 19/69
70 Global rate and WCC bound 20/69 Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k
71 Global rate and WCC bound 20/69 Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k O(1/ k) sublinear rate with overwhelmingly high probability.
72 Global rate and WCC bound 20/69 Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k O(1/ k) sublinear rate with overwhelmingly high probability. Since Pr(K ɛ k) = Pr( G k ɛ), we also get: Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( ) ) ɛ 2 Pr K ɛ O 1 exp [ O(ɛ 2 ) ]. κ 2
73 Global rate and WCC bound Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( )) Pr G 1 k O κ 1 exp [ O(k)]. k O(1/ k) sublinear rate with overwhelmingly high probability. Since Pr(K ɛ k) = Pr( G k ɛ), we also get: Theorem (Gratton, Royer, LNV, and Zhang, 2015 SIOPT) Suppose that {D k } is (p, κ)-probabilistically descent with p > p 0. Then ( ( ) ) ɛ 2 Pr K ɛ O 1 exp [ O(ɛ 2 ) ]. κ 2 O(ɛ 2 ) bound for # of iter. with overwhelmingly high probability. 20/69
74 Two uniform directions are enough, one is not 21/69 g
75 Two uniform directions are enough, one is not 21/69 g d 1 d 1 U(S 1 ) κ (0, 1), ( ) Pr cm (d 1, g) = d 1 g κ < 1/2.
76 Two uniform directions are enough, one is not 21/69 g d 1 d 1 U(S 1 ) κ (0, 1), ( ) Pr cm (d 1, g) = d 1 g κ < 1/2.
77 Two uniform directions are enough, one is not 21/69 g d 1 g d 1 d 2 d 1 U(S 1 ) κ (0, 1), ( ) Pr cm (d 1, g) = d 1 g κ < 1/2. d 1, d 2 U(S 1 ) κ (0, 1), Pr (cm ({d 1, d 2 }, g) κ ) > 1/2.
78 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n.
79 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains
80 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains WCC (number of function evaluations) ( Pr Kɛ f O ( nɛ 2) ) r 1 exp [ O(ɛ 2 ) ].
81 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains WCC (number of function evaluations) ( Pr Kɛ f O ( nɛ 2) ) r 1 exp [ O(ɛ 2 ) ]. The WCC bound is O(rnɛ 2 ), better than when O(n 2 ɛ 2 ) when r n.
82 Worst case complexity: Dependence on the dimension 22/69 Then, when r = D > 1, {D k } is p-probabilistically (1/ n)-descent for some p > p 0 = 1/2 independent of n. Plugging κ = 1/ n into the WCC bound, one obtains WCC (number of function evaluations) ( Pr Kɛ f O ( nɛ 2) ) r 1 exp [ O(ɛ 2 ) ]. The WCC bound is O(rnɛ 2 ), better than when O(n 2 ɛ 2 ) when r n. Theory admits r = 2, leading to O(nɛ 2 )!!!
83 A more detailed look at the numerical experiments 23/69 Relative performance for different sets of polling directions (n = 40). [I I] [Q Q] 2 (γ = 2) 4 (γ = 1.1) arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Now γ = 1 for [I I] and [Q Q].
84 A more detailed look at the numerical experiments 24/69 Relative performance for different sets of polling directions (n = 100). [I I] [Q Q] 2 (γ = 2) 4 (γ = 1.1) arglina arglinb broydn3d dqrtic engval freuroth integreq nondquar sinquad vardim Now γ = 1 for [I I] and [Q Q].
85 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice:
86 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice: Given v R n with v = 1 and κ [0, 1], Pr ( cm(d k, v) κ ) = Pr ( d v κ ) + Pr ( d v κ ) = 2ϱ. with ϱ = Pr ( {d v κ} ).
87 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice: Given v R n with v = 1 and κ [0, 1], Pr ( cm(d k, v) κ ) = Pr ( d v κ ) + Pr ( d v κ ) = 2ϱ. with ϱ = Pr ( {d v κ} ). Given any γ, θ satisfying 0 < θ < 1 < γ, we can pick κ > 0 sufficiently small so that 2ϱ > p 0 = (ln θ)/[ln(γ 1 θ)].
88 Even better than 2 directions 25/69 Two random vectors D k = {d, d} u.d. on the unit sphere of R n works even better, and is an optimal choice: Given v R n with v = 1 and κ [0, 1], Pr ( cm(d k, v) κ ) = Pr ( d v κ ) + Pr ( d v κ ) = 2ϱ. with ϱ = Pr ( {d v κ} ). Given any γ, θ satisfying 0 < θ < 1 < γ, we can pick κ > 0 sufficiently small so that 2ϱ > p 0 = (ln θ)/[ln(γ 1 θ)]. In addition, for any D = {d 1, d 2 } u.d. in the unit sphere, Pr ( cm(d, v) κ ) = 2ϱ Pr ( {d 1 v κ} {d 2 v κ} ) 2ϱ.
89 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which
90 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which the new iterate depends on some object (directions, models), the quality of the object is favorable with a certain probability.
91 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which the new iterate depends on some object (directions, models), the quality of the object is favorable with a certain probability. The technique is based on: counting the number of iterations for which the quality is favorable, examining the probabilistic behavior of this number.
92 So, a new approach and proof technique 26/69 A new proof technique for establishing global rates and WCC bounds for randomized algorithms for which the new iterate depends on some object (directions, models), the quality of the object is favorable with a certain probability. The technique is based on: counting the number of iterations for which the quality is favorable, examining the probabilistic behavior of this number. What we obtain is a global rate of O(1/ k), with overwhelmingly high probability.
93 Examples of application 27/69 This technique has also been applied to: Trust-region methods based on probabilistic models [Gratton, Royer, LNV, and Zhang, 2017 IMAJNA]. When models are built iteratively in some random fashion and exhibit good accuracy with sufficiently high probability.
94 Examples of application 27/69 This technique has also been applied to: Trust-region methods based on probabilistic models [Gratton, Royer, LNV, and Zhang, 2017 IMAJNA]. When models are built iteratively in some random fashion and exhibit good accuracy with sufficiently high probability. Problems with linear constraints [Gratton, Royer, LNV, and Zhang, 2017]. Random generation can be efficiently done in subspaces identified within approximate tangent cones.
95 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this:
96 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this: Relying on probabilistic direct search which is an efficient, simple, and rigorous local solver.
97 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this: Relying on probabilistic direct search which is an efficient, simple, and rigorous local solver. Taking advantage of parallel computing.
98 Global Optimization part of the talk 28/69 Our goal is now to develop a solver for a global optimization problem of the type min x Ω f(x), where Ω R n is a box. We want to do this: Relying on probabilistic direct search which is an efficient, simple, and rigorous local solver. Taking advantage of parallel computing. Having an approach capable of identifying several global or local minimizers of interest.
99 REMEMBER: Probabilistic DS (polling with 2 directions) 29/69 Choose: x 0 and α 0. For k = 0, 1, 2,... (Until α k is suff. small) Search step (optional) Poll step: Select D k = [d, d] RANDOMLY and find x k + α k d k (d k D k ): f(x k + α k d k ) < f(x k ) ρ(α k ) like ρ(α) = α 2 /2. SUCCESS: Move x k+1 = x k + α k d k and increase α k+1 = γ α k (γ = 2). UNSUCCESS: Stay x k+1 = x k and decrease α k+1 = θ α k (θ = 1/2).
100 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA:
101 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel).
102 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated.
103 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters.
104 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters. IN THIS WAY:
105 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters. IN THIS WAY: We parallelize DS runs (where poll steps are typically limited to 2n evaluations).
106 A multistart multisplit DS framework (main idea) 30/69 MAIN IDEA: Consider a certain number R of DS runs guided by the poll centers (in parallel). Create R clusters C j, j = 1,..., R, of points, out of all the points where f has been evaluated. Split or merge the DS runs based on how the poll centers fit into the clusters. IN THIS WAY: We parallelize DS runs (where poll steps are typically limited to 2n evaluations). We allow MULTISTART runs and SPLITTING/MERGING of runs.
107 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0).
108 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l:
109 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l: Create R l clusters C l j, j = 1,..., R l (using all the history of points).
110 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l: Create R l clusters C l j, j = 1,..., R l (using all the history of points). Decide whether any DS run is split or merged. Let, j = 1,..., R l+1 be the new poll centers. x l+1 j
111 A multistart multisplit DS framework (scheme) 31/69 MULTISTART: Select initial points for R 0 DS runs: x 0 j, j = 1,..., R 0 (and initial step sizes α 0 j, j = 1,..., R 0). At each OUTER iteration l: Create R l clusters C l j, j = 1,..., R l (using all the history of points). Decide whether any DS run is split or merged. Let, j = 1,..., R l+1 be the new poll centers. x l+1 j INNER ITERATIONS: Perform (in parallel) a certain number of iterations for all the R l+1 DS runs, starting from x l+1 j, j = 1,..., R l+1.
112 Requirements and approaches 32/69 To align with typical global convergence requirements:
113 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite.
114 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules.
115 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value.
116 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value. We consider two approaches for the OUTER iterations:
117 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value. We consider two approaches for the OUTER iterations: Using space clustering.
118 Requirements and approaches 32/69 To align with typical global convergence requirements: The clustering step must be finite. The DS runs must comply to the local rules. When merging two DS runs, the new poll center must be the one with the lowest objective value. We consider two approaches for the OUTER iterations: Using space clustering. Using nonconvex models built from previous function values.
119 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points).
120 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points). If a cluster contains more than two poll centers, remove the DS run corresponding to the poll center with worst objective value.
121 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points). If a cluster contains more than two poll centers, remove the DS run corresponding to the poll center with worst objective value. If a cluster contains no poll center, then attempt to start a new DS run from the centroid of the cluster.
122 The space clustering approach 33/69 Create R l clusters using kmeans clustering (using all previously evaluated points). If a cluster contains more than two poll centers, remove the DS run corresponding to the poll center with worst objective value. If a cluster contains no poll center, then attempt to start a new DS run from the centroid of the cluster. POSSIBLE DRAWBACK: The kmeans approach does not take into account the previous objective function values (clustering is done in the variables space).
123 The space clustering approach (example) 34/ D Ackley problem - kmeans Cluster 1 Cluster 2 Poll centers Centroid Two poll centers in the same cluster Centroid
124 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, 2006.
125 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, Given a sample set Y = {y 1,..., y np }, a nonconvex piecewise quadratic underestimate of f (with n q quadratics) can be obtained by solving:
126 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, Given a sample set Y = {y 1,..., y np }, a nonconvex piecewise quadratic underestimate of f (with n q quadratics) can be obtained by solving: max p n p q(p; y j ) s.t. q(p; y j ) f(y j ), j = 1,..., n p, j=1 with p containing all the quadratics coefficients p i = (a i, c i, H i ) in
127 A nonconvex piecewise quadratic model 35/69 Modeling can instead identify valleys of convexity. (Nonconvex) quadratic piecewise models were suggested by Mangasarian et al., JOGO, Given a sample set Y = {y 1,..., y np }, a nonconvex piecewise quadratic underestimate of f (with n q quadratics) can be obtained by solving: max p n p q(p; y j ) s.t. q(p; y j ) f(y j ), j = 1,..., n p, j=1 with p containing all the quadratics coefficients p i = (a i, c i, H i ) in { q(p; y) = a i + c i y + 1 } 2 y H i y = q i (y), min 1 i n q and where the H i s must be assured symmetric and PD.
128 A nonconvex piecewise quadratic model (example) 36/69 Nonconvex piecewise underestimate 3 sin 2.5 q q 2 q 3 q 4 q
129 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p.
130 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p. Still, this optimization problem has several drawbacks: It is nonconvex, and has to be solved itself for global optimality to ensure effective underestimation.
131 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p. Still, this optimization problem has several drawbacks: It is nonconvex, and has to be solved itself for global optimality to ensure effective underestimation. The # of variables is excessive: n p (# of auxiliary variables) + (n + 1)(n + 2)/2 n q (quadratic coefficients # of quadratics).
132 A nonconvex piecewise quadratic model (drawbacks) 37/69 One can introduce auxiliary variables (linear objective): max p,γ n p j=1 γ j s.t. q(p; y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), i = 1,..., n q, j = 1,..., n p. Still, this optimization problem has several drawbacks: It is nonconvex, and has to be solved itself for global optimality to ensure effective underestimation. The # of variables is excessive: n p (# of auxiliary variables) + (n + 1)(n + 2)/2 n q (quadratic coefficients # of quadratics). # of constraints also high when imposing PD on all quadratics.
133 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model.
134 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model. Or one can think of taking a subset of n p n p points and considering n q = 1, i.e., fitting only one quadratic q i, resulting in the following LP:
135 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model. Or one can think of taking a subset of n p n p points and considering n q = 1, i.e., fitting only one quadratic q i, resulting in the following LP: max p i,γ n p j=1 γ j s.t. q i (y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), j = 1,..., n p.
136 A nonconvex piecewise quadratic model (simplified) 38/69 But still, one could think of computing just a feasible point for the model. Or one can think of taking a subset of n p n p points and considering n q = 1, i.e., fitting only one quadratic q i, resulting in the following LP: max p i,γ n p j=1 γ j s.t. q i (y j ) f(y j ), j = 1,..., n p, γ j q i (y j ), j = 1,..., n p. The # of variables reduces to n p (# of auxiliary variables) + (n + 1)(n + 2)/2 (one quadratic).
137 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points.
138 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them).
139 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points.
140 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points. Assess how well the quadratic models fit the whole data: θ i,j = (n p) j l=1 ( f(y j l ) q i(y j l ) f(y j l ) + 1 ) 2, i, j = 1,..., R.
141 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points. Assess how well the quadratic models fit the whole data: θ i,j = (n p) j l=1 ( f(y j l ) q i(y j l ) f(y j l ) + 1 ) 2, i, j = 1,..., R. If θ i,i is large, then split the DS run i (the quadratic i does not fit well cluster i).
142 The nonconvex modeling approach 39/69 Consider again R runs of DS and all previously evaluated points. Form R clusters of points around the R poll centers (using distances to them). Compute the quadratic underestimating models for each cluster of points. Assess how well the quadratic models fit the whole data: θ i,j = (n p) j l=1 ( f(y j l ) q i(y j l ) f(y j l ) + 1 ) 2, i, j = 1,..., R. If θ i,i is large, then split the DS run i (the quadratic i does not fit well cluster i). In addition, if θ i,j, i j, is small, then merge DS runs i and j, using the poll center with lowest objective value.
143 The nonconvex modeling approach (example) 40/69 Nonconvex piecewise underestimate 3 sin 2.5 q q 2 q 3 q 4 q θ i,j values: clusters or runs q q q q
144 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is:
145 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima.
146 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems.
147 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned.
148 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned. The results are presented using performance profiles.
149 Numerical results 41/69 We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned. The results are presented using performance profiles. The solver starts with 10 poll centers in serial. A maximum of 10 simultaneous runs is imposed.
150 Numerical results We provide some numerical results with a set of 102 global optimization problems. Our goal here is: To understand if our approach is sound, and to see if it is capable of identifying global minima. To show that it performs efficiently on a wide and well representative set of problems. To develop a solver/code reasonably tested and tuned. The results are presented using performance profiles. The solver starts with 10 poll centers in serial. A maximum of 10 simultaneous runs is imposed. A comparison is made with PSwarm (a direct search solver that uses a particle swarm heuristic as a search step). 41/69
151 Performance profiles (# function evaluations) 42/69 1 Number of function evaluations MMDS (2n dirs, kmeans) MMDS (2n dirs, nonconvex) MMDS (2 opposite, kmeans) MMDS (2 opposite, nonconvex) PSwarm
152 Performance profiles (quality of final f) 43/69 1 Objective function values MMDS (2n dirs, kmeans) MMDS (2n dirs, nonconvex) MMDS (2 opposite, kmeans) MMDS (2 opposite, nonconvex) PSwarm
153 Numerical results (application problem) 44/69 Is this approach capable of computing different global minima for an application problem?
154 Numerical results (application problem) 44/69 Is this approach capable of computing different global minima for an application problem? Consider optimizing the orientation of a part, to be built by additive manufacturing (3D printer), where the staircase effect is to be minimized: min (θ x,θ y) j h 2 A j d.n j 2 if d.n j 1, 0 otherwise, where h is the slicing thickness, A j is the area of the j triangle, d is the slicing direction, and n j is the j triangle normal.
155 Numerical results (application problem) 44/69 Is this approach capable of computing different global minima for an application problem? Consider optimizing the orientation of a part, to be built by additive manufacturing (3D printer), where the staircase effect is to be minimized: min (θ x,θ y) j h 2 A j d.n j 2 if d.n j 1, 0 otherwise, where h is the slicing thickness, A j is the area of the j triangle, d is the slicing direction, and n j is the j triangle normal. This is a 2D problem where (θ x, θ y ) [0, 180] 2 are the rotation degrees along the x-axis and y-axis, respectively, in a three dimension space.
156 A used part 45/69 We consider an aerospace complex part for numerical testing. Nearly triangles are needed to define the part.
157 The part/object to be considered 46/69 Slices are taken along the z axis with 5mm high, resulting in 64 slices. Before optimization
158 Objective function landscape 47/
159 Numerical results (space clustering) 48/
160 Numerical results (space clustering) 49/
161 Numerical results (space clustering) 50/
162 Numerical results (space clustering) 51/
163 Numerical results (space clustering) 52/
164 Numerical results (space clustering) 53/69 The solver stops at (θ x, θ y ) = (90, 180), with f = e + 05, taking 124 function evaluations.
165 Numerical results (space clustering) 53/69 The solver stops at (θ x, θ y ) = (90, 180), with f = e + 05, taking 124 function evaluations. The space clustering approach determined only one of the global minimizers!
166 54/69 Numerical results (nonconvex modeling) q
167 55/69 Numerical results (nonconvex modeling) q 1,q
168 Numerical results (nonconvex modeling) min of q s 56/
169 Numerical results (nonconvex modeling) min of q s 57/
170 Numerical results (nonconvex modeling) min of q s 58/
171 Numerical results (nonconvex modeling) min of q s 59/
172 Numerical results (nonconvex modeling) min of q s 60/
173 Numerical results (nonconvex modeling) min of q s /69
174 Numerical results (nonconvex modeling) min of q s 62/
175 Numerical results (nonconvex modeling) min of q s 63/
176 Numerical results (nonconvex modeling) min of q s 64/
177 Numerical results (nonconvex modeling) min of q s 65/
178 Numerical results (nonconvex modeling) min of q s 66/
179 Numerical results (nonconvex modeling) min of q s 67/
180 Numerical results (nonconvex modeling) 68/69 The solver stops now at (θ x, θ y ) = (90, 180) and (θ x, θ y ) = (90, 0), both with f = e + 05, taking 2152 function evaluations.
181 Numerical results (nonconvex modeling) 68/69 The solver stops now at (θ x, θ y ) = (90, 180) and (θ x, θ y ) = (90, 0), both with f = e + 05, taking 2152 function evaluations. The nonconvex modeling approach determined both global minimizers of the problem!
182 The result of Optimization 69/69 Slices are taken along the z axis with 5mm high. After optimization
Direct Search Based on Probabilistic Descent
Direct Search Based on Probabilistic Descent S. Gratton C. W. Royer L. N. Vicente Z. Zhang January 22, 2015 Abstract Direct-search methods are a class of popular derivative-free algorithms characterized
More informationWorst Case Complexity of Direct Search
Worst Case Complexity of Direct Search L. N. Vicente October 25, 2012 Abstract In this paper we prove that the broad class of direct-search methods of directional type based on imposing sufficient decrease
More informationWorst Case Complexity of Direct Search
Worst Case Complexity of Direct Search L. N. Vicente May 3, 200 Abstract In this paper we prove that direct search of directional type shares the worst case complexity bound of steepest descent when sufficient
More informationWorst case complexity of direct search
EURO J Comput Optim (2013) 1:143 153 DOI 10.1007/s13675-012-0003-7 ORIGINAL PAPER Worst case complexity of direct search L. N. Vicente Received: 7 May 2012 / Accepted: 2 November 2012 / Published online:
More informationOn the optimal order of worst case complexity of direct search
On the optimal order of worst case complexity of direct search M. Dodangeh L. N. Vicente Z. Zhang May 14, 2015 Abstract The worst case complexity of direct-search methods has been recently analyzed when
More informationMesures de criticalité d'ordres 1 et 2 en recherche directe
Mesures de criticalité d'ordres 1 et 2 en recherche directe From rst to second-order criticality measures in direct search Clément Royer ENSEEIHT-IRIT, Toulouse, France Co-auteurs: S. Gratton, L. N. Vicente
More informationDirect search based on probabilistic feasible descent for bound and linearly constrained problems
Direct search based on probabilistic feasible descent for bound and linearly constrained problems S. Gratton C. W. Royer L. N. Vicente Z. Zhang March 8, 2018 Abstract Direct search is a methodology for
More informationInterpolation-Based Trust-Region Methods for DFO
Interpolation-Based Trust-Region Methods for DFO Luis Nunes Vicente University of Coimbra (joint work with A. Bandeira, A. R. Conn, S. Gratton, and K. Scheinberg) July 27, 2010 ICCOPT, Santiago http//www.mat.uc.pt/~lnv
More informationSimple Complexity Analysis of Simplified Direct Search
Simple Complexity Analysis of Simplified Direct Search Jaub Konečný Peter Richtári School of Mathematics University of Edinburgh United Kingdom September 30, 014 (Revised : November 13, 014) Abstract We
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationComplexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization
Mathematical Programming manuscript No. (will be inserted by the editor) Complexity Analysis of Interior Point Algorithms for Non-Lipschitz and Nonconvex Minimization Wei Bian Xiaojun Chen Yinyu Ye July
More informationComplexity analysis of second-order algorithms based on line search for smooth nonconvex optimization
Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,
More informationImplicitely and Densely Discrete Black-Box Optimization Problems
Implicitely and Densely Discrete Black-Box Optimization Problems L. N. Vicente September 26, 2008 Abstract This paper addresses derivative-free optimization problems where the variables lie implicitly
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationOptimisation non convexe avec garanties de complexité via Newton+gradient conjugué
Optimisation non convexe avec garanties de complexité via Newton+gradient conjugué Clément Royer (Université du Wisconsin-Madison, États-Unis) Toulouse, 8 janvier 2019 Nonconvex optimization via Newton-CG
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationUSING SIMPLEX GRADIENTS OF NONSMOOTH FUNCTIONS IN DIRECT SEARCH METHODS
USING SIMPLEX GRADIENTS OF NONSMOOTH FUNCTIONS IN DIRECT SEARCH METHODS A. L. CUSTÓDIO, J. E. DENNIS JR., AND L. N. VICENTE Abstract. It has been shown recently that the efficiency of direct search methods
More informationUSING SIMPLEX GRADIENTS OF NONSMOOTH FUNCTIONS IN DIRECT SEARCH METHODS
Pré-Publicações do Departamento de Matemática Universidade de Coimbra Preprint Number 06 48 USING SIMPLEX GRADIENTS OF NONSMOOTH FUNCTIONS IN DIRECT SEARCH METHODS A. L. CUSTÓDIO, J. E. DENNIS JR. AND
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationA derivative-free nonmonotone line search and its application to the spectral residual method
IMA Journal of Numerical Analysis (2009) 29, 814 825 doi:10.1093/imanum/drn019 Advance Access publication on November 14, 2008 A derivative-free nonmonotone line search and its application to the spectral
More informationStationarity results for generating set search for linearly constrained optimization
Stationarity results for generating set search for linearly constrained optimization Virginia Torczon, College of William & Mary Collaborators: Tammy Kolda, Sandia National Laboratories Robert Michael
More information1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:
Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More informationConstrained optimization
Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Compressed sensing Convex constrained
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationContents. Preface. 1 Introduction Optimization view on mathematical models NLP models, black-box versus explicit expression 3
Contents Preface ix 1 Introduction 1 1.1 Optimization view on mathematical models 1 1.2 NLP models, black-box versus explicit expression 3 2 Mathematical modeling, cases 7 2.1 Introduction 7 2.2 Enclosing
More information12. Interior-point methods
12. Interior-point methods Convex Optimization Boyd & Vandenberghe inequality constrained minimization logarithmic barrier function and central path barrier method feasibility and phase I methods complexity
More informationLecture 6: September 12
10-725: Optimization Fall 2013 Lecture 6: September 12 Lecturer: Ryan Tibshirani Scribes: Micol Marchetti-Bowick Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationIPAM Summer School Optimization methods for machine learning. Jorge Nocedal
IPAM Summer School 2012 Tutorial on Optimization methods for machine learning Jorge Nocedal Northwestern University Overview 1. We discuss some characteristics of optimization problems arising in deep
More informationComplexity of gradient descent for multiobjective optimization
Complexity of gradient descent for multiobjective optimization J. Fliege A. I. F. Vaz L. N. Vicente July 18, 2018 Abstract A number of first-order methods have been proposed for smooth multiobjective optimization
More informationGlobally Convergent Evolution Strategies and CMA-ES
Globally Convergent Evolution Strategies and CMA-ES Y. Diouane, S. Gratton and L. N. Vicente Technical Report TR/PA/2/6 Publications of the Parallel Algorithms Team http://www.cerfacs.fr/algor/publications/
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationGlobal and derivative-free optimization Lectures 1-4
Global and derivative-free optimization Lectures 1-4 Coralia Cartis, University of Oxford INFOMM CDT: Contemporary Numerical Techniques Global and derivative-free optimizationlectures 1-4 p. 1/46 Lectures
More informationGradient methods for minimizing composite functions
Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum
More informationConstrained Optimization Theory
Constrained Optimization Theory Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) Constrained Optimization Theory IMA, August
More informationGradient methods for minimizing composite functions
Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationApplied Machine Learning for Biomedical Engineering. Enrico Grisan
Applied Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Data representation To find a representation that approximates elements of a signal class with a linear combination
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationUniform Convergence of a Multilevel Energy-based Quantization Scheme
Uniform Convergence of a Multilevel Energy-based Quantization Scheme Maria Emelianenko 1 and Qiang Du 1 Pennsylvania State University, University Park, PA 16803 emeliane@math.psu.edu and qdu@math.psu.edu
More informationTrust-Region SQP Methods with Inexact Linear System Solves for Large-Scale Optimization
Trust-Region SQP Methods with Inexact Linear System Solves for Large-Scale Optimization Denis Ridzal Department of Computational and Applied Mathematics Rice University, Houston, Texas dridzal@caam.rice.edu
More informationLecture 4: Linear and quadratic problems
Lecture 4: Linear and quadratic problems linear programming examples and applications linear fractional programming quadratic optimization problems (quadratically constrained) quadratic programming second-order
More informationKey words. Global optimization, multistart strategies, direct-search methods, pattern search methods, nonsmooth calculus.
GLODS: GLOBAL AND LOCAL OPTIMIZATION USING DIRECT SEARCH A. L. CUSTÓDIO AND J. F. A. MADEIRA Abstract. Locating and identifying points as global minimizers is, in general, a hard and timeconsuming task.
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationCSC 576: Gradient Descent Algorithms
CSC 576: Gradient Descent Algorithms Ji Liu Department of Computer Sciences, University of Rochester December 22, 205 Introduction The gradient descent algorithm is one of the most popular optimization
More informationLecture 13 October 6, Covering Numbers and Maurey s Empirical Method
CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 13 October 6, 2016 Scribe: Kiyeon Jeon and Loc Hoang 1 Overview In the last lecture we covered the lower bound for p th moment (p > 2) and
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationEE364 Review Session 4
EE364 Review Session 4 EE364 Review Outline: convex optimization examples solving quasiconvex problems by bisection exercise 4.47 1 Convex optimization problems we have seen linear programming (LP) quadratic
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationGlobal convergence of trust-region algorithms for constrained minimization without derivatives
Global convergence of trust-region algorithms for constrained minimization without derivatives P.D. Conejo E.W. Karas A.A. Ribeiro L.G. Pedroso M. Sachine September 27, 2012 Abstract In this work we propose
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More information6-3 Solving Systems by Elimination
Another method for solving systems of equations is elimination. Like substitution, the goal of elimination is to get one equation that has only one variable. To do this by elimination, you add the two
More information1 Randomized Computation
CS 6743 Lecture 17 1 Fall 2007 1 Randomized Computation Why is randomness useful? Imagine you have a stack of bank notes, with very few counterfeit ones. You want to choose a genuine bank note to pay at
More informationA user s guide to Lojasiewicz/KL inequalities
Other A user s guide to Lojasiewicz/KL inequalities Toulouse School of Economics, Université Toulouse I SLRA, Grenoble, 2015 Motivations behind KL f : R n R smooth ẋ(t) = f (x(t)) or x k+1 = x k λ k f
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationCS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms
CS 372: Computational Geometry Lecture 14 Geometric Approximation Algorithms Antoine Vigneron King Abdullah University of Science and Technology December 5, 2012 Antoine Vigneron (KAUST) CS 372 Lecture
More informationApproximation algorithms for nonnegative polynomial optimization problems over unit spheres
Front. Math. China 2017, 12(6): 1409 1426 https://doi.org/10.1007/s11464-017-0644-1 Approximation algorithms for nonnegative polynomial optimization problems over unit spheres Xinzhen ZHANG 1, Guanglu
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationLecture 5: September 15
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS
More informationSOME PROPERTIES OF INTEGRAL APOLLONIAN PACKINGS
SOME PROPERTIES OF INTEGRAL APOLLONIAN PACKINGS HENRY LI Abstract. Within the study of fractals, some very interesting number theoretical properties can arise from unexpected places. The integral Apollonian
More informationA New Trust Region Algorithm Using Radial Basis Function Models
A New Trust Region Algorithm Using Radial Basis Function Models Seppo Pulkkinen University of Turku Department of Mathematics July 14, 2010 Outline 1 Introduction 2 Background Taylor series approximations
More informationMini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization
Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of
More informationUniversity of Houston, Department of Mathematics Numerical Analysis, Fall 2005
3 Numerical Solution of Nonlinear Equations and Systems 3.1 Fixed point iteration Reamrk 3.1 Problem Given a function F : lr n lr n, compute x lr n such that ( ) F(x ) = 0. In this chapter, we consider
More informationDevelopment of an algorithm for solving mixed integer and nonconvex problems arising in electrical supply networks
Development of an algorithm for solving mixed integer and nonconvex problems arising in electrical supply networks E. Wanufelle 1 S. Leyffer 2 A. Sartenaer 1 Ph. Toint 1 1 FUNDP, University of Namur 2
More informationKaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä. New Proximal Bundle Method for Nonsmooth DC Optimization
Kaisa Joki Adil M. Bagirov Napsu Karmitsa Marko M. Mäkelä New Proximal Bundle Method for Nonsmooth DC Optimization TUCS Technical Report No 1130, February 2015 New Proximal Bundle Method for Nonsmooth
More informationSpectral gradient projection method for solving nonlinear monotone equations
Journal of Computational and Applied Mathematics 196 (2006) 478 484 www.elsevier.com/locate/cam Spectral gradient projection method for solving nonlinear monotone equations Li Zhang, Weijun Zhou Department
More informationAn Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization
An Inexact Sequential Quadratic Optimization Method for Nonlinear Optimization Frank E. Curtis, Lehigh University involving joint work with Travis Johnson, Northwestern University Daniel P. Robinson, Johns
More informationNew class of limited-memory variationally-derived variable metric methods 1
New class of limited-memory variationally-derived variable metric methods 1 Jan Vlček, Ladislav Lukšan Institute of Computer Science, Academy of Sciences of the Czech Republic, L. Lukšan is also from Technical
More informationCopositive Programming and Combinatorial Optimization
Copositive Programming and Combinatorial Optimization Franz Rendl http://www.math.uni-klu.ac.at Alpen-Adria-Universität Klagenfurt Austria joint work with I.M. Bomze (Wien) and F. Jarre (Düsseldorf) IMA
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose
More informationIntroduction to gradient descent
6-1: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction to gradient descent Derivation and intuitions Hessian 6-2: Introduction to gradient descent Prof. J.C. Kao, UCLA Introduction Our
More informationBubble Clustering: Set Covering via Union of Ellipses
Bubble Clustering: Set Covering via Union of Ellipses Matt Kraning, Arezou Keshavarz, Lei Zhao CS229, Stanford University Email: {mkraning,arezou,leiz}@stanford.edu Abstract We develop an algorithm called
More informationLecture Examples of problems which have randomized algorithms
6.841 Advanced Complexity Theory March 9, 2009 Lecture 10 Lecturer: Madhu Sudan Scribe: Asilata Bapat Meeting to talk about final projects on Wednesday, 11 March 2009, from 5pm to 7pm. Location: TBA. Includes
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex
More informationLAGRANGE MULTIPLIERS
LAGRANGE MULTIPLIERS MATH 195, SECTION 59 (VIPUL NAIK) Corresponding material in the book: Section 14.8 What students should definitely get: The Lagrange multiplier condition (one constraint, two constraints
More informationA glimpse into convex geometry. A glimpse into convex geometry
A glimpse into convex geometry 5 \ þ ÏŒÆ Two basis reference: 1. Keith Ball, An elementary introduction to modern convex geometry 2. Chuanming Zong, What is known about unit cubes Convex geometry lies
More informationThe simplex algorithm
The simplex algorithm The simplex algorithm is the classical method for solving linear programs. Its running time is not polynomial in the worst case. It does yield insight into linear programs, however,
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationConvex Optimization in Classification Problems
New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1
More informationSemidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization
Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization Instructor: Farid Alizadeh Author: Ai Kagawa 12/12/2012
More informationTowards stability and optimality in stochastic gradient descent
Towards stability and optimality in stochastic gradient descent Panos Toulis, Dustin Tran and Edoardo M. Airoldi August 26, 2016 Discussion by Ikenna Odinaka Duke University Outline Introduction 1 Introduction
More informationSTATIONARITY RESULTS FOR GENERATING SET SEARCH FOR LINEARLY CONSTRAINED OPTIMIZATION
STATIONARITY RESULTS FOR GENERATING SET SEARCH FOR LINEARLY CONSTRAINED OPTIMIZATION TAMARA G. KOLDA, ROBERT MICHAEL LEWIS, AND VIRGINIA TORCZON Abstract. We present a new generating set search (GSS) approach
More informationNonlinear Programming (Hillier, Lieberman Chapter 13) CHEM-E7155 Production Planning and Control
Nonlinear Programming (Hillier, Lieberman Chapter 13) CHEM-E7155 Production Planning and Control 19/4/2012 Lecture content Problem formulation and sample examples (ch 13.1) Theoretical background Graphical
More informationLecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods
Stat260/CS294: Spectral Graph Methods Lecture 20-04/07/205 Lecture: Local Spectral Methods (3 of 4) Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning: these notes are still very rough. They provide
More informationIntroduction. New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems
New Nonsmooth Trust Region Method for Unconstraint Locally Lipschitz Optimization Problems Z. Akbari 1, R. Yousefpour 2, M. R. Peyghami 3 1 Department of Mathematics, K.N. Toosi University of Technology,
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More information