The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds
Sequential Black-Box Optimization Problem Statement Let f : X R where X could be a subset of R D, non-parametric,... We consider the problem of finding the maximum of f denoted by: f = sup f (x), x X via successive (expensive) queries f (x 1 ), f (x 2 ),... Noisy Observations At iteration T we choose x T +1 using the previous noisy observations Y T = {y 1,..., y T }, where t T : y t = f (x t ) + ɛ t and ɛ t iid N (0, η 2 ). Contal 3/41 Gaussian Processes Optimization
Objective Regret (unknown in practice) The efficiency of a policy is measured via the simple or cumulative regret: } S T = min t T {f f (x t ) R T = ( ) t T f f (x t ). Goal S T t 0 as fast as possible (e.g. numerical optimization) R T = o(t ) as small as possible (e.g clinical trials, ads campaign) Our aim is to obtain upper bounds on S T an R T with high probability. Contal 4/41 Gaussian Processes Optimization
Exploration/Exploitation tradeoff 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) 1 0 1 parameter Contal 5/41 Gaussian Processes Optimization
Exploration/Exploitation tradeoff 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) 1 0 1 parameter Contal 5/41 Gaussian Processes Optimization
Gaussian Processes Definition f GP(m, k) with mean function m : X R and covariance function k : X X R +, when for all x 1,..., x n X we have: ( ) ( [m(xi f (x 1 ),..., f (x n ) N ) ], [ k(x xi i, x j ) ] ). xi,xj Probabilistic Smoothness Assumption Nearby locations are highly correlated Large local variations have low probability Example of Covariance Function Squared Exponential RBF: k(x, y) = exp( x y 2 2 Rational Quadratic: k(x, y) = ( 1 + x y 2 2 2αl 2 ) 2l ) 2 α Contal 6/41 Gaussian Processes Optimization
Gaussian Processes: Examples Figure: Samples of 1D Gaussian Processes with different covariance functions Contal 7/41 Gaussian Processes Optimization
Posterior Distribution Bayesian Inference Rasmussen and Williams (2006) Given the observations Y t = [y 1,..., y t ] at the query locations X t = (x 1,..., x t ) we compute for all x X : µ t (x) := E[f (x) X t, Y t ] = k t (x) C 1 t Y t σt 2 (x) := V[f (x) X t, Y t ] = k(x, x) k t (x) C 1 t k t (x) where C t = K t + η 2 I and K t = [k(x i, x j )] xi,x j X t Interpretation posterior mean µ t : prediction posterior deviation σ t : uncertainty Contal 8/41 Gaussian Processes Optimization
Gaussian Processes Confidence Bounds Contal 9/41 Gaussian Processes Optimization
Setup: Summary Assumptions f GP(0, k) with known covariance k, y t x t + ɛ t where ɛ t N (0, η 2 ) with known η. Regrets { } S T = min f f (x t ), t T R T = ( ) f f (x t ). t T Contal 10/41 Gaussian Processes Optimization
Related Work Bayesian Optimization Bull (2011): Expected Improvement Algorithm Hennig et al. (2012): Entropy Search Algorithm Upper Confidence Bounds Freitas et al. (2012): Deterministic GP Srinivas et al. (2012): GP-UCB Djolonga et al. (2013): High-dimensional GP Chaining Grunewalder et al. (2010): Known horizon bandits Gaillard and Gerchinovitz. (2015): Online regression Contal 11/41 Gaussian Processes Optimization
Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds
Upper Confidence Bounds (UCB) Strategy: Upper Confidence Bounds (UCB) If we have with high probability, f f (x t ) UCB t (x t ), then we can control the regret: R T t T UCB t (x t ) and S T 1 T UCB t (x t ). t T Contal 13/41 Gaussian Processes Optimization
Canonical Pseudo-Distance A First UCB Fix x X, let d 2 t (x, x t ) = V[f (x ) f (x t ) X t, Y t ] = σ 2 t (x ) 2k t (x, x t ) + σ 2 t (x t ). For all δ > 0, set β δ = 2 log δ 1, with probability at least 1 δ: Union Bound over X f (x ) f (x t ) µ t (x ) µ t (x t ) + β δ d t (x, x t ). With X <, we have with probability at least 1 X δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + β δ d t (x, x t ). x X x X Contal 14/41 Gaussian Processes Optimization
Canonical Pseudo-Distance A First UCB Fix x X, let d 2 t (x, x t ) = V[f (x ) f (x t ) X t, Y t ] = σ 2 t (x ) 2k t (x, x t ) + σ 2 t (x t ). For all δ > 0, set β δ = 2 log δ 1, with probability at least 1 δ: Union Bound over X f (x ) f (x t ) µ t (x ) µ t (x t ) + β δ d t (x, x t ). With X <, we have with probability at least 1 X δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + β δ d t (x, x t ). x X x X But what if X =? Contal 14/41 Gaussian Processes Optimization
Covering Numbers ε-net T X is an ε-net of X for d t iff: x X, x T s.t. d t (x, x ) ε Covering Numbers The covering number N(X, d t, ε) is the size of the smallest ε-net. Contal 15/41 Gaussian Processes Optimization
An ε-net for the Euclidean Distance X ε Contal 16/41 Gaussian Processes Optimization
Hierarchical Covers Assumption (w.l.o.g) x, y X, k(x, y) 1. Since d t (x, y) k(x, y), any point of X is a 1-net of X for d t. Hierarchical Covers Let T = (T i ) i 0 such that for all i 0: T i is an ε i -net with ε i = 2 i, T i T i+1. Contal 17/41 Gaussian Processes Optimization
Hierarchical Covers: ɛ 0 = 1 Contal 18/41 Gaussian Processes Optimization
Hierarchical Covers: ɛ 1 = 1 2 Contal 18/41 Gaussian Processes Optimization
Hierarchical Covers: ɛ 2 = 1 4 Contal 18/41 Gaussian Processes Optimization
Hierarchical Covers: ɛ 3 = 1 8 Contal 18/41 Gaussian Processes Optimization
Localized Chaining Projection to T starting from x t toward x Define π i (x ) = argmin xi T i {x t} d t (x, x i ), then: π i (x ) i x, π i (x ) = x t if d t (x, x t ) < ε i. The Chaining Trick f (x ) f (x t ) = i:ɛ i <d t(x,x t) f (π i (x )) f (π i 1 (x )). Contal 19/41 Gaussian Processes Optimization
Localized Chaining Projection to T starting from x t toward x Define π i (x ) = argmin xi T i {x t} d t (x, x i ), then: π i (x ) i x, π i (x ) = x t if d t (x, x t ) < ε i. The Chaining Trick sup f (x ) f (x t ) = sup x X x X i:ɛ i <d t(x,x t) f (π i (x )) f (π i 1 (x )). Contal 19/41 Gaussian Processes Optimization
The Chaining Trick x x t Contal 20/41 Gaussian Processes Optimization
Upper Confidence Bound Converging distances d t ( πi (x ), π i 1 (x ) ) ε i 1 UCB at depth i: Union Bound on T i For any i 1, with probability at least 1 T i δ: sup f (π i (x )) f (π i 1 (x )) sup µ t (π i (x )) µ t (π i 1 (x )) + β δ ε i 1 x X x X Final UCB Set β δ,i = 2 log ( i 2 T i δ 1), with probability at least 1 π2 6 δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + ε i β δ,i. x X x X i:ε i <d t(x,x) Contal 21/41 Gaussian Processes Optimization
Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds
The Chaining-UCB Algorithm Contal et al. (2015) UCB policy x t+1 argmax µ t (x) + ε i β δ,i. x X i:ε i <σ t(x) Practical Remark The algorithm needs only to compute the β δ,i for i where ε i > min x X σ t(x). Contal 23/41 Gaussian Processes Optimization
Upper Confidence Bound Contal 24/41 Gaussian Processes Optimization
Greedy Cover NP-hardness Computing the β δ,i requires to compute the hierarchical ε i -nets. Finding the smallest ε i -net is NP-hard. Greedy optimal approximation T ; X X while X = do x argmax x X { x X : d(x, x ) ε } T T {x} X X \ {x X : d(x, x ) ε} end return T Contal 25/41 Gaussian Processes Optimization
Theorem: Generic bounds for the Chaining-UCB algorithm For δ > 0, denoting σ t = σ t (x t ), we have t 1, c δ R: sup f (x ( ) ) f (x t ) σ t cδ 6 log σ t + 9 ε i log N(X, dt, ε i ), x X i:ε i <σ t with probability at least 1 δ. Contal 26/41 Gaussian Processes Optimization
Corollary when Controlling the Metric Entropy Assumption D R such that N(X, d n, ε) = O ( ( ) ε D). It suffices that d n (x, x ) = O x x 2 and X [0, R] D. e.g. Squared Exponential covariance, Matérn covariance,... Corollary ( ) sup f (x ) f (x t ) O Dσt log σt 1, x X ( T ) thus, R T O D σ t log σt 1. t=1 Contal 27/41 Gaussian Processes Optimization
Information Gain Lemma [Srinivas et al., 2012] T σt 2 O ( ) γ T, t=1 where γ T = max X X : X =T I (X ), the maximum information gain on f by a set of T observations. For GP, I (X ) = 1 2 log det ( I + η 2 K X ). Upper Bounds Linear covariance k(x, y) = x y, γ T O ( D log T ) Squared exp covariance k(x, y) = e 1 2 x y 2 2, γt O ( (logt ) D+1) Matérn covariance with parameter ν > 1, γ T O ( (logt )T a), with a = D(D+1) 2ν+D(D+1) < 1. Contal 28/41 Gaussian Processes Optimization
Corollary for the regret Linear covariance, R T O ( DT log T ) Squared exp covariance, R T O ( T (log T ) D+2) Matérn covariance, R T O ( log T T a), with a = ν+d+d2 2ν+D+D 2. Contal 29/41 Gaussian Processes Optimization
Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds
Experiments 1/4: Himmelblau s function 10 0 Chaining UCB GP UCB Random 10 1 S n 10 2 10 3 0 10 20 30 40 50 60 70 80 90 100 Iteration n Contal 31/41 Gaussian Processes Optimization
Experiments 2/4: SE kernel Chaining UCB GP UCB Random 10 0 S n 10 1 0 10 20 30 40 50 60 70 80 90 100 Iteration n Contal 32/41 Gaussian Processes Optimization
Experiments 3/4: Wave Energy Converter 10 0 Chaining UCB GP UCB Random 10 1 S n 10 2 10 3 0 20 40 60 80 100 120 140 160 180 200 Iteration n Contal 33/41 Gaussian Processes Optimization
Experiments 4/4: Graphs kernel S n 10 0 10 1 10 2 Chaining UCB GP UCB Random f( ) = 1.3 f( ) = 0.2 f( ) = 2.7 10 3 0 10 20 30 40 50 60 70 80 Iteration n Contal 34/41 Gaussian Processes Optimization
Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds
Optimization of Other Stochastic Processes Minimal assumption f : X R a stochastic process with pseudo-distance d(, ) and ψ u ( ) such that: [ Pr f (x) f (x ( ) > ψ u d(x, x ) )] < e u. Example: Quadratic Form of GP Applications f (x) = N i=1 g 2 i (x) where g i GP(0, k i ). Optimization of a costly mean-square-error Optimization of a costly Gaussian likelihood (Bayesian model calibration) Contal 36/41 Gaussian Processes Optimization
Noise-free Optimization Problem setting f GP(0, k) and y t = f (x t ). Algorithm Pre-compute the hierarchical ε i -nets for d 0 and build the tree. For x in the tree, let δ (x) = i>depth(x) ɛ iβ δ,i. Evaluate f at the root. Loop x t+1 = argmax x Childs(x1,...,x t) f (x) + δ (x). Contal 37/41 Gaussian Processes Optimization
Noise-free Optimization: Results Property With probability at least 1 δ, f f (x t ) δ (x t ). Lemma { x t : ɛ δ (x t ) ɛ ( 1 + Depth(x t ) ) } 1 2 O(1). Theorem (Ongoing Work) If N(X, d 0, ε) = O ( ε D), for the previous algorithm, R T = O(1) and S T = O(e T ). Contal 38/41 Gaussian Processes Optimization
Lower Bounds on the Supremum of the GP Reminder: UCB rewritten With probability at least 1 δ, for all x at depth h in the tree, sup f (x ) f (x) cste ɛ i β δ,i. x B(x,ɛ h ) i>h Theorem: LCB (Ongoing Work) With probability at least 1 δ, for all x at depth h in the tree, sup f (x ) f (x) cste ɛ i β δ,i. x B(x,ɛ h ) i>h Contal 39/41 Gaussian Processes Optimization
Conclusion The Chaining-UCB Algorithm automatic calibration of the exploration/exploitation tradeoff adapts to various settings computationally tractable Matlab code online http://econtal.perso.math.cnrs.fr/software/ Contal 40/41 Gaussian Processes Optimization
Contal, E., Malherbe, C., and Vayatis, N. (2015). Optimization for gaussian processes via chaining. NIPS Workshop on Bayesian Optimization. Munos, R. (2011). Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Advances in neural information processing systems (NIPS). Rasmussen, C. E. and Williams, C. (2006). Gaussian Processes for Machine Learning. MIT Press. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2012). Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265. Contal 41/41 Gaussian Processes Optimization