Optimisation séquentielle et application au design

Size: px

Start display at page:

Download "Optimisation séquentielle et application au design"

Egbert Mills
6 years ago
Views:

1 Optimisation séquentielle et application au design d expériences Nicolas Vayatis Séminaire Aristote, Ecole Polytechnique - 23 octobre 2014

2 Joint work with Emile Contal (computer scientist, PhD student) and: David Buffoni Computer scientist, postdoc researcher Vianney Perchet Mathematician, maître de conférences, Université Paris-Diderot Alexandre Robicquet Undergraduate student in applied mathematics Themistoklis Stefanakis Civil engineer, PhD in Fluid Mechanics at CMLA and tsunami experts: Frédéric Dias Professor, School of Mathematical Sciences, University College Dublin Costas Synolakis Professor, Department of Civil and Environmental Engineering, University of California San Diego

3 Tsunamis Amplification Phenomena Numerical simulations of a tsunami amplification generated by a conical island

4 Setup: sequential and batch-sequential optimization Gaussian Process setup Two novel algorithms for sequential optimization Regret bounds Numerical experiments

5 Problem statement Optimization of an unknown function Parameter space X R d compact and convex Unknown objective function f (x) R for all x X Noisy measurement y = f (x) + ɛ, where ɛ iid N (0, η 2 ) Find the parameter vector x maximizing f (x) Sequential setup and performance metric Queries x 1, x 2,... and feedback y 1, y 2,... Goal: minimize cumulative regret after T iterations: T ( ) R T = f (x ) f (x t ) t=1

6 Problem statement Optimization of an unknown function Parameter space X R d compact and convex Unknown objective function f (x) R for all x X Noisy measurement y = f (x) + ɛ, where ɛ iid N (0, η 2 ) Find the parameter vector x maximizing f (x) Batch-Sequential setup and performance metric Batch queries {x 1 t,..., x K t } and batch feedback {y 1 t,..., y K t } at each t Goal: minimize cumulative regret after T iterations: T ( ) R T = f (x ) max f (x t k ) 1 k K t=1

7 Constraints Challenges Large number of parameters High level of noise Expensive evaluations Cope with nonconcave functions: exploration vs. exploitation Example: Tsunamis 5 parameters Each simulation takes 2 hours of computation A regular grid with 10 values per parameters needs 10 5 points A naive approach would take 23 years of computation

8 Sequential Optimization 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) parameter

9 Sequential Optimization 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) parameter

10 Batch-Sequential Optimization 1 objective 0 1 x5 1? x 5 2? x 5 3? (x 4, y 4) (x 3, y 3) (x 1, y 1) 2 (x 2, y 2) parameter

11 Main Approaches to Query Selection Experimental design [Fedorov, 1972]... Bayesian optimization (BO) [Moore and Schneider, 1995][Srinivas et al., 2010]... Active learning [Carpentier et al., 2011] [Chen and Krause, 2013] Multiarmed bandits [Auer, 2002] [Audibert et al., 2011]...

12 Classical Strategies for Query Selection in BO Maximum Mean (MM) or PMAX [Moore and Schneider, 1995] Maximum Upper Interval (MUI) or IEMAX [Moore and Schneider, 1995] Maximum Probability of Improvement (MPI) [Mockus, 1989] Maximum Expected Improvement (MEI) [Jones et al., 1998] [Locatelli, 1997] Gaussian Process Upper Confidence Bound (GP-UCB) [Cox and John, 1997] [Auer, 2002], [Srinivas et al., 2010], [Desautels et al., 2012]

13 Setup: sequential and batch-sequential optimization Gaussian Process setup Two novel algorithms for sequential optimization Regret bounds Numerical experiments

14 Gaussian Processes Framework Definition f GP(m, k), with mean function m : X R and covariance function k : X X R +, when for all x 1,..., x n, ( f (x1 ),..., f (x n ) ) N (µ, C), with µ[x i ] = m(x i ) and C[x i, x j ] = k(x i, x j ). Probabilistic smoothness assumption Nearby location are highly correlated Large local variation have low probability

15 Typical Kernels Polynomial with degree α N: for c R x 1, x 2, k(x 1, x 2 ) = (x T 1 x 2 + c) α Radial Basis Function with length-scale parameter b > 0: x 1, x 2, k(x 1, x 2 ) = exp ( x 1 x 2 2 ) 2b 2 Matérn with length-scale b > 0 and order ν: ( ) x 1, x 2, k(x 1, x 2 ) = 21 ν Γ(ν) Φ 2ν x1 x 2 ν b where Φ ν (z) = z ν K ν (z) and K ν is a Bessel function of the second kind with order ν.

16 Gaussian Processes Examples 1D Gaussian Processes with different covariance functions

17 Gaussian Process Interpolation Bayesian Inference [Rasmussen and Williams, 2006] At iteration t, with observations Y t for the query points X t, the posterior mean and variances are given at all point x in the search space by: µ t (x) = k t (x) C 1 t Y t (1) σ 2 t (x) = k(x, x) k t (x) C 1 t k t (x), (2) where C t = K t + η 2 I, and k t (x) = [k(x τ, x)] 1 τ t, and K t = [k(x τ, x τ )] 1 τ,τ t. Interpretation posterior mean µ t : prediction posterior variance σ 2 t : uncertainty

18 Example: Bayesian inference with 4 observations

19 Mutual Information An Important Ingredient Information Gain The information gain on f at X T is the mutual information between f and Y T. For a GP distribution with K T the kernel matrix of X T : I T (X T ) = 1 2 log det(i + η 2 K T ). We define γ T = max X =T I T (X ) the maximum information gain by a sequence of T queries points. Empirical Lower Bound For GPs with bounded variance, we have: [Srinivas et al. 2012] T γ T = σt 2 2 (x t ) Cγ T where C = log(1 + η 2 ) t=1

20 Mutual Information Examples The parameter γ T is the maximum mutual information about f obtainable by a sequence of T queries. Linear kernel: γ T = O(d log T ) RBF kernel: γ T = O ( (log T ) d+1) Matérn kernel: where α = γ T = O ( T α log T ), d(d + 1) 2ν + d(d + 1) 1.

21 Setup: sequential and batch-sequential optimization Gaussian Process setup Two novel algorithms for sequential optimization Regret bounds Numerical experiments

22 Upper and Lower Confidence Bounds Definition Fix 0 < δ < 1, and consider upper/lower confidence bounds on f : defined in f + Property (Srinivas, 2012) t (x) = µ t (x) + ft (x) = µ t (x) β t σt 2 (x) β t σt 2 (x) Fix δ > 0. With the choice β t (δ) = O ( log(t/δ) ), we have: x X, t 1, f (x) [ ft (x), f t + (x) ], with probability at least (1 δ).

23 Relevant Region R t Definition The Relevant Region R t is defined by, y t = max t (x), { } R t = x X f t + (x) y t. x X f Property We have: x R t, with probability at least (1 δ).

24 Relevant Region R t

25 Upper Confidence Bound and Pure Exploration UCB policy: k = 1 Achieves tradeoff between exploitation vs. exploration (µ t vs. σ 2 t ): where R + t = xt+1 1 argmax x R + t { x X µ t (x) + 2 f t + (x) β t σ 2 t (x) y t PE policy: k = 2,..., K Selects the most uncertain points inside the Relevant Region: xt+1 k argmax σ (k) t (x), for 2 k K, x R + t where σ (k) t (x) is the updated uncertainty using xt+1 1,..., x t+1 k 1 }

26 Algorithm 1: GP-UCB-PE β t slowly increasing for t = 1, 2,... do Compute µ t and σ 2 t with Bayesian inference on y 1 1,..., y K t 1 Compute R + t x 1 t+1 argmax x R + t for k = 2,..., K do Update σ (k) t x k t+1 argmax x R + t Query x 1 t+1,..., x K t+1 Observe y 1 t+1,..., y K t+1 f t + (x) σ (k) t (x)

27 The GP-UCB-PE algorithm [Contal et al., 2013] 1 x

28 The GP-UCB-PE algorithm [Contal et al., 2013] 1 0 x 1 x

29 GP-MI A Novel Algorithm for Sequential Optimization Algorithm 2: GP-MI γ 0 0, α fixed for t = 1, 2,... do Compute µ t and σt 2 using Bayesian inference φ t (x) α ( σ 2t (x) + γ t 1 γ ) t 1 x t argmax x X µ t (x) + φ t (x) γ t γ t 1 + σ 2 t (x t ) Query at x t and observe y t

30 Setup: sequential and batch-sequential optimization Gaussian Process setup Two novel algorithms for sequential optimization Regret bounds Numerical experiments

31 Regret bound on GP-UCB-PE General result Consider f GP(0, k) with k(x, x) 1 for all x, then we have, with probability at least (1 δ): R K T = O ( (T K ) ) γ TK log T Specialized results Linear kernel: RT (log(tk) K = O ) dt /K ( RBF kernel: RT K = O (T /K) ( log(tk) ) ) d+2 Matérn kernel: R K T = O ( log(tk) T α+1 K α 1 )

32 Two Competitors for Batch Strategies GP-BUCB = GP Batch UCB [Desautels et al., 2012] Batch estimation based on updates µ k t (x) of µ t (x) Regret bound with RBF kernel due to initialization: ( ( (2d ) ) d (T O exp e K ) ) log(tk) SM-UCB = Simulation Matching with UCB [Azimi et al., 2010] Select batch of points that matches expected behavior Based on a greedy K-medoid algorithm to screen irrelevant data points No regret bound available

33 Regret bound for GP-MI General result Consider f GP(0, k) with k(x, x) 1 for all x, then we have, with probability at least (1 δ): ( ) ( ) 2 2 R T 5 Cγ T log + 4 log δ δ where C = 2 log(1+η 2 ). Specialized results For linear kernel: R T = O( d log T ) For RBF kernel: R T = O ( (log T ) d+1) For Matérn kernel: R T = O ( T α log T ),

34 Setup: sequential and batch-sequential optimization Gaussian Process setup Two novel algorithms for sequential optimization Regret bounds Numerical experiments

35 Experiments Competitors for batch-sequential: GP-BUCB and SM-UCB Competitors for sequential: GP-UCB and GP-EI Assessment: synthetic problems and real-data benchmarks (a) Himmelblau s function (b) Gaussian Mixture

36 Numerical results for sequential-batch strategy GP-UCB-PE Regret r K t Iteration t GP-BUCB SM-UCB GP-UCB-PE (a) Generated GP Iteration t (b) Himmelblau Iteration t (c) Gaussian mixture Regret r K t Iteration t Iteration t Iteration t (d) Mackey-Glass (e) Tsunamis (f) Abalone

37 Numerical results for sequential strategy GP-MI UCB EI MI (g) Generated GP (d = 4) UCB 1 EI 0.5 MI (h) Himmelblau RT /T EI UCB 0.5 MI ,000 (i) Gaussian mixture UCB 0.2 EI 0.1 MI ,000 RT /T MI EI UCB RT /T UCB EI MI (j) Mackey-Glass (k) Tsunamis (l) Branin

38 Conclusion 1/2 GP-UCB-PE and GP-MI Generic sequential optimization methods Good theoretical guarantees for cumulative regret - what about simple regret? Efficient in practice Easy to implement Matlab source code online at:

39 Conclusion 2/2 Further developments In progress: Nonparametric approach (active learning) In progress: Application to other fields and to multiobjective optimization Automotive industry Wind power-based energy plants Challenge: how to set physical priors in the design space?

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating