The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Size: px

Start display at page:

Download "The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan"

Natalie Chapman
5 years ago
Views:

1 The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

2 Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

3 Sequential Black-Box Optimization Problem Statement Let f : X R where X could be a subset of R D, non-parametric,... We consider the problem of finding the maximum of f denoted by: f = sup f (x), x X via successive (expensive) queries f (x 1 ), f (x 2 ),... Noisy Observations At iteration T we choose x T +1 using the previous noisy observations Y T = {y 1,..., y T }, where t T : y t = f (x t ) + ɛ t and ɛ t iid N (0, η 2 ). Contal 3/41 Gaussian Processes Optimization

4 Objective Regret (unknown in practice) The efficiency of a policy is measured via the simple or cumulative regret: } S T = min t T {f f (x t ) R T = ( ) t T f f (x t ). Goal S T t 0 as fast as possible (e.g. numerical optimization) R T = o(t ) as small as possible (e.g clinical trials, ads campaign) Our aim is to obtain upper bounds on S T an R T with high probability. Contal 4/41 Gaussian Processes Optimization

5 Exploration/Exploitation tradeoff 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) parameter Contal 5/41 Gaussian Processes Optimization

6 Exploration/Exploitation tradeoff 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) parameter Contal 5/41 Gaussian Processes Optimization

7 Gaussian Processes Definition f GP(m, k) with mean function m : X R and covariance function k : X X R +, when for all x 1,..., x n X we have: ( ) ( [m(xi f (x 1 ),..., f (x n ) N ) ], [ k(x xi i, x j ) ] ). xi,xj Probabilistic Smoothness Assumption Nearby locations are highly correlated Large local variations have low probability Example of Covariance Function Squared Exponential RBF: k(x, y) = exp( x y 2 2 Rational Quadratic: k(x, y) = ( 1 + x y 2 2 2αl 2 ) 2l ) 2 α Contal 6/41 Gaussian Processes Optimization

8 Gaussian Processes: Examples Figure: Samples of 1D Gaussian Processes with different covariance functions Contal 7/41 Gaussian Processes Optimization

9 Posterior Distribution Bayesian Inference Rasmussen and Williams (2006) Given the observations Y t = [y 1,..., y t ] at the query locations X t = (x 1,..., x t ) we compute for all x X : µ t (x) := E[f (x) X t, Y t ] = k t (x) C 1 t Y t σt 2 (x) := V[f (x) X t, Y t ] = k(x, x) k t (x) C 1 t k t (x) where C t = K t + η 2 I and K t = [k(x i, x j )] xi,x j X t Interpretation posterior mean µ t : prediction posterior deviation σ t : uncertainty Contal 8/41 Gaussian Processes Optimization

10 Gaussian Processes Confidence Bounds Contal 9/41 Gaussian Processes Optimization

11 Setup: Summary Assumptions f GP(0, k) with known covariance k, y t x t + ɛ t where ɛ t N (0, η 2 ) with known η. Regrets { } S T = min f f (x t ), t T R T = ( ) f f (x t ). t T Contal 10/41 Gaussian Processes Optimization

12 Related Work Bayesian Optimization Bull (2011): Expected Improvement Algorithm Hennig et al. (2012): Entropy Search Algorithm Upper Confidence Bounds Freitas et al. (2012): Deterministic GP Srinivas et al. (2012): GP-UCB Djolonga et al. (2013): High-dimensional GP Chaining Grunewalder et al. (2010): Known horizon bandits Gaillard and Gerchinovitz. (2015): Online regression Contal 11/41 Gaussian Processes Optimization

13 Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

14 Upper Confidence Bounds (UCB) Strategy: Upper Confidence Bounds (UCB) If we have with high probability, f f (x t ) UCB t (x t ), then we can control the regret: R T t T UCB t (x t ) and S T 1 T UCB t (x t ). t T Contal 13/41 Gaussian Processes Optimization

15 Canonical Pseudo-Distance A First UCB Fix x X, let d 2 t (x, x t ) = V[f (x ) f (x t ) X t, Y t ] = σ 2 t (x ) 2k t (x, x t ) + σ 2 t (x t ). For all δ > 0, set β δ = 2 log δ 1, with probability at least 1 δ: Union Bound over X f (x ) f (x t ) µ t (x ) µ t (x t ) + β δ d t (x, x t ). With X <, we have with probability at least 1 X δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + β δ d t (x, x t ). x X x X Contal 14/41 Gaussian Processes Optimization

16 Canonical Pseudo-Distance A First UCB Fix x X, let d 2 t (x, x t ) = V[f (x ) f (x t ) X t, Y t ] = σ 2 t (x ) 2k t (x, x t ) + σ 2 t (x t ). For all δ > 0, set β δ = 2 log δ 1, with probability at least 1 δ: Union Bound over X f (x ) f (x t ) µ t (x ) µ t (x t ) + β δ d t (x, x t ). With X <, we have with probability at least 1 X δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + β δ d t (x, x t ). x X x X But what if X =? Contal 14/41 Gaussian Processes Optimization

17 Covering Numbers ε-net T X is an ε-net of X for d t iff: x X, x T s.t. d t (x, x ) ε Covering Numbers The covering number N(X, d t, ε) is the size of the smallest ε-net. Contal 15/41 Gaussian Processes Optimization

18 An ε-net for the Euclidean Distance X ε Contal 16/41 Gaussian Processes Optimization

19 Hierarchical Covers Assumption (w.l.o.g) x, y X, k(x, y) 1. Since d t (x, y) k(x, y), any point of X is a 1-net of X for d t. Hierarchical Covers Let T = (T i ) i 0 such that for all i 0: T i is an ε i -net with ε i = 2 i, T i T i+1. Contal 17/41 Gaussian Processes Optimization

20 Hierarchical Covers: ɛ 0 = 1 Contal 18/41 Gaussian Processes Optimization

21 Hierarchical Covers: ɛ 1 = 1 2 Contal 18/41 Gaussian Processes Optimization

22 Hierarchical Covers: ɛ 2 = 1 4 Contal 18/41 Gaussian Processes Optimization

23 Hierarchical Covers: ɛ 3 = 1 8 Contal 18/41 Gaussian Processes Optimization

24 Localized Chaining Projection to T starting from x t toward x Define π i (x ) = argmin xi T i {x t} d t (x, x i ), then: π i (x ) i x, π i (x ) = x t if d t (x, x t ) < ε i. The Chaining Trick f (x ) f (x t ) = i:ɛ i <d t(x,x t) f (π i (x )) f (π i 1 (x )). Contal 19/41 Gaussian Processes Optimization

25 Localized Chaining Projection to T starting from x t toward x Define π i (x ) = argmin xi T i {x t} d t (x, x i ), then: π i (x ) i x, π i (x ) = x t if d t (x, x t ) < ε i. The Chaining Trick sup f (x ) f (x t ) = sup x X x X i:ɛ i <d t(x,x t) f (π i (x )) f (π i 1 (x )). Contal 19/41 Gaussian Processes Optimization

26 The Chaining Trick x x t Contal 20/41 Gaussian Processes Optimization

27 Upper Confidence Bound Converging distances d t ( πi (x ), π i 1 (x ) ) ε i 1 UCB at depth i: Union Bound on T i For any i 1, with probability at least 1 T i δ: sup f (π i (x )) f (π i 1 (x )) sup µ t (π i (x )) µ t (π i 1 (x )) + β δ ε i 1 x X x X Final UCB Set β δ,i = 2 log ( i 2 T i δ 1), with probability at least 1 π2 6 δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + ε i β δ,i. x X x X i:ε i <d t(x,x) Contal 21/41 Gaussian Processes Optimization

28 Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

29 The Chaining-UCB Algorithm Contal et al. (2015) UCB policy x t+1 argmax µ t (x) + ε i β δ,i. x X i:ε i <σ t(x) Practical Remark The algorithm needs only to compute the β δ,i for i where ε i > min x X σ t(x). Contal 23/41 Gaussian Processes Optimization

30 Upper Confidence Bound Contal 24/41 Gaussian Processes Optimization

31 Greedy Cover NP-hardness Computing the β δ,i requires to compute the hierarchical ε i -nets. Finding the smallest ε i -net is NP-hard. Greedy optimal approximation T ; X X while X = do x argmax x X { x X : d(x, x ) ε } T T {x} X X \ {x X : d(x, x ) ε} end return T Contal 25/41 Gaussian Processes Optimization

32 Theorem: Generic bounds for the Chaining-UCB algorithm For δ > 0, denoting σ t = σ t (x t ), we have t 1, c δ R: sup f (x ( ) ) f (x t ) σ t cδ 6 log σ t + 9 ε i log N(X, dt, ε i ), x X i:ε i <σ t with probability at least 1 δ. Contal 26/41 Gaussian Processes Optimization

33 Corollary when Controlling the Metric Entropy Assumption D R such that N(X, d n, ε) = O ( ( ) ε D). It suffices that d n (x, x ) = O x x 2 and X [0, R] D. e.g. Squared Exponential covariance, Matérn covariance,... Corollary ( ) sup f (x ) f (x t ) O Dσt log σt 1, x X ( T ) thus, R T O D σ t log σt 1. t=1 Contal 27/41 Gaussian Processes Optimization

34 Information Gain Lemma [Srinivas et al., 2012] T σt 2 O ( ) γ T, t=1 where γ T = max X X : X =T I (X ), the maximum information gain on f by a set of T observations. For GP, I (X ) = 1 2 log det ( I + η 2 K X ). Upper Bounds Linear covariance k(x, y) = x y, γ T O ( D log T ) Squared exp covariance k(x, y) = e 1 2 x y 2 2, γt O ( (logt ) D+1) Matérn covariance with parameter ν > 1, γ T O ( (logt )T a), with a = D(D+1) 2ν+D(D+1) < 1. Contal 28/41 Gaussian Processes Optimization

35 Corollary for the regret Linear covariance, R T O ( DT log T ) Squared exp covariance, R T O ( T (log T ) D+2) Matérn covariance, R T O ( log T T a), with a = ν+d+d2 2ν+D+D 2. Contal 29/41 Gaussian Processes Optimization

36 Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

37 Experiments 1/4: Himmelblau s function 10 0 Chaining UCB GP UCB Random 10 1 S n Iteration n Contal 31/41 Gaussian Processes Optimization

38 Experiments 2/4: SE kernel Chaining UCB GP UCB Random 10 0 S n Iteration n Contal 32/41 Gaussian Processes Optimization

Experiments 3/4: Wave Energy Converter 10 0 Chaining UCB GP UCB Random 10 1 S n 10 2 10 3 0

39 Experiments 3/4: Wave Energy Converter 10 0 Chaining UCB GP UCB Random 10 1 S n Iteration n Contal 33/41 Gaussian Processes Optimization

40 Experiments 4/4: Graphs kernel S n Chaining UCB GP UCB Random f( ) = 1.3 f( ) = 0.2 f( ) = Iteration n Contal 34/41 Gaussian Processes Optimization

41 Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

42 Optimization of Other Stochastic Processes Minimal assumption f : X R a stochastic process with pseudo-distance d(, ) and ψ u ( ) such that: [ Pr f (x) f (x ( ) > ψ u d(x, x ) )] < e u. Example: Quadratic Form of GP Applications f (x) = N i=1 g 2 i (x) where g i GP(0, k i ). Optimization of a costly mean-square-error Optimization of a costly Gaussian likelihood (Bayesian model calibration) Contal 36/41 Gaussian Processes Optimization

43 Noise-free Optimization Problem setting f GP(0, k) and y t = f (x t ). Algorithm Pre-compute the hierarchical ε i -nets for d 0 and build the tree. For x in the tree, let δ (x) = i>depth(x) ɛ iβ δ,i. Evaluate f at the root. Loop x t+1 = argmax x Childs(x1,...,x t) f (x) + δ (x). Contal 37/41 Gaussian Processes Optimization

44 Noise-free Optimization: Results Property With probability at least 1 δ, f f (x t ) δ (x t ). Lemma { x t : ɛ δ (x t ) ɛ ( 1 + Depth(x t ) ) } 1 2 O(1). Theorem (Ongoing Work) If N(X, d 0, ε) = O ( ε D), for the previous algorithm, R T = O(1) and S T = O(e T ). Contal 38/41 Gaussian Processes Optimization

45 Lower Bounds on the Supremum of the GP Reminder: UCB rewritten With probability at least 1 δ, for all x at depth h in the tree, sup f (x ) f (x) cste ɛ i β δ,i. x B(x,ɛ h ) i>h Theorem: LCB (Ongoing Work) With probability at least 1 δ, for all x at depth h in the tree, sup f (x ) f (x) cste ɛ i β δ,i. x B(x,ɛ h ) i>h Contal 39/41 Gaussian Processes Optimization

46 Conclusion The Chaining-UCB Algorithm automatic calibration of the exploration/exploitation tradeoff adapts to various settings computationally tractable Matlab code online Contal 40/41 Gaussian Processes Optimization

47 Contal, E., Malherbe, C., and Vayatis, N. (2015). Optimization for gaussian processes via chaining. NIPS Workshop on Bayesian Optimization. Munos, R. (2011). Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Advances in neural information processing systems (NIPS). Rasmussen, C. E. and Williams, C. (2006). Gaussian Processes for Machine Learning. MIT Press. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2012). Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5): Contal 41/41 Gaussian Processes Optimization

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration Emile Contal David Buffoni Alexandre Robicquet Nicolas Vayatis CMLA, ENS Cachan, France September 25, 2013 Motivating