The optimistic principle applied to function optimization

Size: px

Start display at page:

Download "The optimistic principle applied to function optimization"

Dwight Arnold
5 years ago
Views:

1 The optimistic principle applied to function optimization Rémi Munos Google DeepMind INRIA Lille, Sequel team LION 9, 2015

2 The optimistic principle applied to function optimization Optimistic principle: Play optimally in the most favorable world compatible with observations Function optimization: Lipschitz optimization Extensions: Relax Lipschitz assumption Design tractable algorithms Deal with unknown metric, Handle noise Numerical experiments

3 Optimization of a deterministic Lipschitz function Problem: Find online the maximum of f : X IR, assumed to be Lipschitz: f (x) f (y) L x y }{{} 2. l(x,y) Protocol: For each time step t = 1, 2,..., n, Select a state x t X Observe f (x t ) Return x(n) Loss: where f = sup x X f (x). r n = f f (x(n)),

4 Example in 1d f * f f(x ) t x t Lipschitz property the evaluation of f at x t provides a first upper-bound on f : B(x) = f (x t ) + l(x, x t ).

5 Example in 1d (continued) New point refined upper-bound: B(x) = min t [f (x t ) + l(x, x t )]

6 Example in 1d (continued) Question: where should one sample the next point? Answer: select the point with highest upper bound! Optimism in the face of (partial observation) uncertainty

7 Several issues 1. Lipschitz assumption is too strong 2. Finding the maximum of the upper-bounding function is computationally difficult! 3. What if we don t know the metric l? 4. How to handle noise?

8 Local smoothness property Assumption: f is locally smooth around its max. w.r.t. a semi-metric l: for all x X, f (x ) f (x) l(x, x ). f(x ) f f(x ) l(x, x ) x X

9 Local smoothness is enough! f(x ) f x Upper-bounding function: B(x) = min t [f (x t ) + l(x, x t )] Property: B(x ) f (x ) Optimistic principle only requires a true bound at the maximum

10 Efficient implementation Use a hierarchical partitioning of the space. Ex: The partitioning can be represented by a tree.

11 Deterministic Optimistic Optimization (DOO) For t = 1 to n, Evaluate the function at the center x i of each cell i Split the cell with highest bound I t = argmax f (x i ) + max l(x, x i ), i x X i }{{} diam l (X i ) Return x(n) def = argmax {x t} 1 t n f (x t )

12 Analysis of DOO Theorem 1. def The loss r n = f (x ) f (x(n)) of DOO is O ( n 1/d) for d > 0, r n = O ( e cn) for d = 0. where d is the near-optimality dimension of f w.r.t. l: C, ɛ, the set of ε-optimal states X ε def = {x X, f (x) f ε} can be covered by Cε d l-balls of radius ε.

13 Example 1: Assume the function is piecewise linear at its maximum: f (x ) f (x) = Θ( x x ). ε ε Using l(x, y) = x y, it takes O(ɛ 0 ) balls of radius ɛ to cover X ε. Thus d = 0.

14 Example 2: Assume the function is locally quadratic around its maximum: f (x ) f (x) = Θ( x x 2 ). ε f f(x ) x x ε For l(x, y) = x y, it takes O(ɛ D/2 ) balls of radius ɛ to cover X ε (of size O(ɛ D/2 )). Thus d = D/2.

15 Example 2: Assume the function is locally quadratic around its maximum: f (x ) f (x) = Θ( x x 2 ) ε f f(x ) x x 2 ε For l(x, y) = x y 2, it takes O(ɛ 0 ) l-balls of radius ɛ to cover X ε. Thus d = 0.

16 Example 3: Assume the function has a square-root behavior around its maximum: f (x ) f (x) = Θ( x x 1/2 ) ε f For l(x, y) = x y 1/2 we have d = 0. ε f(x ) x x 1/2

17 About the local smoothness assumption Assume X = [0, 1] D and f is locally equivalent to a polynomial of degree α > 0 around its maximum (i.e. f is α-smooth): f (x ) f (x) = Θ( x x α ) Consider the semi-metric l(x, y) = x y β, for some β > 0. If α = β, then d = 0. If α > β, then d = D( 1 β 1 α ) > 0. If α < β, then the function is not locally smooth wrt l. DOO heavilly depends on our knowledge of the true local smoothness.

18 Several possible metrics A function may be locally smooth w.r.t. several metrics: l 1 (x, y) = 14 x y (i.e., f is globally Lipschitz) l 2 (x, y) = 222 x y 2 (i.e., f is locally quadratic)

19 Experiments [2] Using l 1 (x, y) = 14 x y (i.e., f is globally Lipschitz). n = 150.

20 Experiments [3] Using l 2 (x, y) = 222 x y 2 (i.e., f is locally quadratic). n = 150.

21 Experiments [4] n uniform grid DOO with l 1 (d = 1/2) DOO with l 2 (d = 0) Loss r n for different values of n for a uniform grid and DOO with the two semi-metric l 1 and l 2.

22 What if the smoothness is unknown? Assume f is locally smooth w.r.t. some unknown metric l. Question: Is it possible to implement the optimistic principle? Answer: Be optimistic for all possible metrics! Simultaneous Optimistic Optimization (SOO) [Munos, 2011]: Expand several cells simultaneously SOO expands at most one cell per size SOO expands a cell only if its value is larger that the value of all cells of same or larger size. (inspired by DIRECT algorithm [Jones et al., 1993])

23 SOO algorithm Input: the maximum depth function t h max (t) Initialization: T 1 = {(0, 0)} (root node). Set t = 1. while True do Set v max =. for h = 0 to min(depth(t t ), h max (t)) do Select the leaf (h, j) L t of depth h with max f (x h,j ) value if f (x h,i ) > v max then Expand the node (h, i), Set v max = f (x h,i ), Set t = t + 1 if t = n then return x(n) = arg max (h,i) Tn x h,i end if end for end while.

48 Performance of SOO Theorem 2 (Munos, 2014). For any semi-metric l such that f is locally smooth w.r.t. l. Let d be the near-optimality dimension of f w.r.t. l. Then { Õ(n r n = 1/d ) if d > 0, O(e c n ) if d = 0 Remarks: The algorithm does not use the knowledge of l, thus the analysis holds for the best possible choice of l SOO does almost as well as DOO optimally fitted.

49 Numerical experiments Again for the function f (x) = (sin(13x) sin(27x) + 1)/2 we have: n uniform grid DOO with l 1 DOO with l 2 SOO

50 The case d = 0 is non-trivial! Example: f is locally α-smooth around its maximum: for some α > 0. f (x ) f (x) = Θ( x x α ), SOO algorithm does not require the knowledge of l, Using l(x, y) = x y α in the analysis, all assumptions are satisfied (with γ = 3 α/d and d = 0, thus the loss of SOO is r n = O(3 nα/(cd) ) This is almost as good as DOO optimally fitted! (Extends to the case f (x ) f (x) D i=1 c i x i x i α i )

51 The case d = 0 More generally, any function whose upper- and lower envelopes around x have the same shape: c > 0 and η > 0, such that min(η, cl(x, x )) f (x ) f (x) l(x, x ), for all x X. has a near-optimality d = 0 (w.r.t. the metric l). f(x ) f(x ) cl(x, x ) f(x ) η f(x ) l(x, x ) x

52 Example of functions for which d = 0 l(x, y) = c x y 2. Thus r n = O(e c n )

53 Example of functions with d = 0 l(x, y) = c x y 1/2. Thus r n = O(e c n )

54 d = 0? l(x, y) = c x y 1/2.

55 d > 0 f (x) = 1 x + ( x 2 + x) (sin(1/x 2 ) + 1)/2 The lower-envelope is of order 1/2 whereas the upper one is of order 2. We deduce that d = 3/2 and r n = O(n 2/3 ).

56 How to handle noise? The evaluation of f at x t is perturbed by noise: y t = f (x t ) + ɛ t, with E[ɛ t x t ] = 0. f * f f(x ) t x t

57 Stochastic SOO (StoSOO) Extends SOO to stochastic evaluations: Select the cells X i (at most one per depth) according to SOO based on the UCBs: log n µ i,t + c T i (t), and get one more value y t = f (x i ) + ɛ t of f at x i, If T i (t) k, then split the cell X i. Remark: This really looks like UCT [Kocsis, Szepesvári, 2006] except that several cells are selected at each round. Implements the optimistic principle (here UCB) at several scales.

58 Performance of StoSOO Theorem 3 (Valko, Carpentier, Munos, 2013). For any semi-metric l such that f is locally smooth w.r.t. l. Assume the near-optimality dimension of f w.r.t. l is d = 0. By choosing k = n, h (log n) 3 max (n) = (log n) 3/2, the expected loss of StoSOO is ( (log n) 2 Er n = O ). n This is almost as good as the best known bandits algorithms in metric spaces optimally fitted (HOO [Bubeck, Munos, Szepesvári, Stoltz, 2011], Zooming [Kleinberg, Slivkins, Upfal, 2008]). See also [Bull, 2013].

59 Numerical evaluation of SOO We consider the CEC 2014 Competition on Single Objective Real-Parameter Numerical Optimization. Several test functions defined in [ 100, 100] D in dimension D {10,..., 100} using a finite budget of function evaluations. Joint work with Philippe Preux and Michal Valko.

60 Shifted and Rotated Rosenbrock s Function

61 Shifted and Rotated Ackley s Function

62 Shifted and Rotated Weierstrass Function

63 Shifted and Rotated Griewank s Function

64 Shifted Rastrigin s Function

65 Shifted Schwefel s Function

66 Shifted and Rotated HappyCat Function

67 Shifted and Rotated Expanded Griewank s plus Rosenbrock s Function

68 Shifted and Rotated Expanded Scaffer s Function

69 Composition Function 1

70 Composition Function 2

71 Composition Function 3

72 Performance of SOO for CEC 2014 competition Function 10 D 30 D 50 D 100 D Rosenbrock Ackley Weierstrass Griewank Rastrigin Schwefel HappyCat Gri. + Ros Scaffer Compo Compo Compo Relative error f (x(n))/f, for n = 10 5 function evaluations.

73 Post-processing with local optimizer Use 5% of budget allocated to post-optimization (using NLopt). Function 10D 30D no PO with PO no PO with PO Rosenbrock Ackley Weierstrass Griewank Rastrigin Schwefel HappyCat Gri. + Ros Scaffer Compo Compo Compo

74 Conclusions The optimistic principle Explore the most promising areas of the search space first Multi-scales exploration Natural transition from global to local search Dimension is not the only measure of complexity Rather it is local smoothness of the function around the maximum and how much we know about this smoothness

75 Reference From bandits to Monte-Carlo Tree Search: The optimistic principle applied to optimization and planning. Foundations and Trends in Machine Learning, Vol 7, n 1, p , Thanks!!!

Stochastic Simultaneous Optimistic Optimization

Stochastic Simultaneous Optimistic Optimization Michal Valko michal.valko@inria.fr INRIA Lille - Nord Europe, SequeL team, 4 avenue Halley 5965, Villeneuve d Ascq, France Alexandra Carpentier a.carpentier@statslab.cam.ac.uk Statistical Laboratory, CMS,