The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Similar documents
Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Gaussian Process Optimization with Mutual Information

Optimisation séquentielle et application au design

Quantifying mismatch in Bayesian optimization

Talk on Bayesian Optimization

Probabilistic numerics for deep learning

arxiv: v3 [stat.ml] 8 Jun 2015

Predictive Variance Reduction Search

Gaussian Process Regression

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

STAT 518 Intro Student Presentation

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Multi-armed bandit models: a tutorial

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Bandit models: a tutorial

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Advanced Introduction to Machine Learning CMU-10715

A parametric approach to Bayesian optimization with pairwise comparisons

The information complexity of sequential resource allocation

Parallelised Bayesian Optimisation via Thompson Sampling

GAUSSIAN PROCESS REGRESSION

The information complexity of best-arm identification

GWAS V: Gaussian processes

Online Learning with Feedback Graphs

Neutron inverse kinetics via Gaussian Processes

Lecture 1c: Gaussian Processes for Regression

On Bayesian bandit algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Online Learning and Sequential Decision Making

Advanced Machine Learning

Introduction to Gaussian Processes

Probabilistic & Unsupervised Learning

Prediction of double gene knockout measurements

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Bandits : optimality in exponential families

Nonparameteric Regression:

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Truncated Variance Reduction: A Unified Approach to Bayesian Optimization and Level-Set Estimation

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Fast Likelihood-Free Inference via Bayesian Optimization

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Bayesian Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Gaussian Processes in Machine Learning

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Gaussian processes for inference in stochastic differential equations

Stochastic bandits: Explore-First and UCB

Sparse Linear Contextual Bandits via Relevance Vector Machines

Bayesian optimization for automatic machine learning

Contextual Gaussian Process Bandit Optimization

The optimistic principle applied to function optimization

Kernel methods, kernel SVM and ridge regression

arxiv: v4 [cs.lg] 9 Jun 2010

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Multi-Attribute Bayesian Optimization under Utility Uncertainty

MTTTS16 Learning from Multiple Sources

Variable sigma Gaussian processes: An expectation propagation perspective

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

High Dimensional Bayesian Optimization via Restricted Projection Pursuit Models

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Grundlagen der Künstlichen Intelligenz

Gaussian with mean ( µ ) and standard deviation ( σ)

Dynamic Batch Bayesian Optimization

Learning Gaussian Process Models from Uncertain Data

Modelling Transcriptional Regulation with Gaussian Processes

Introduction to Gaussian Process

arxiv: v1 [stat.ml] 24 Oct 2016

A COLLABORATIVE 20 QUESTIONS MODEL FOR TARGET SEARCH WITH HUMAN-MACHINE INTERACTION

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Model-Based Reinforcement Learning with Continuous States and Actions

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Active and Semi-supervised Kernel Classification

Bandit View on Continuous Stochastic Optimization

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Worst-Case Bounds for Gaussian Process Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Announcements. Proposals graded

Kernels for Automatic Pattern Discovery and Extrapolation

CPSC 540: Machine Learning

The Bayesian approach to inverse problems

Online Forest Density Estimation

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Reliability Monitoring Using Log Gaussian Process Regression

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 5: GPs and Streaming regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Transcription:

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

Sequential Black-Box Optimization Problem Statement Let f : X R where X could be a subset of R D, non-parametric,... We consider the problem of finding the maximum of f denoted by: f = sup f (x), x X via successive (expensive) queries f (x 1 ), f (x 2 ),... Noisy Observations At iteration T we choose x T +1 using the previous noisy observations Y T = {y 1,..., y T }, where t T : y t = f (x t ) + ɛ t and ɛ t iid N (0, η 2 ). Contal 3/41 Gaussian Processes Optimization

Objective Regret (unknown in practice) The efficiency of a policy is measured via the simple or cumulative regret: } S T = min t T {f f (x t ) R T = ( ) t T f f (x t ). Goal S T t 0 as fast as possible (e.g. numerical optimization) R T = o(t ) as small as possible (e.g clinical trials, ads campaign) Our aim is to obtain upper bounds on S T an R T with high probability. Contal 4/41 Gaussian Processes Optimization

Exploration/Exploitation tradeoff 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) 1 0 1 parameter Contal 5/41 Gaussian Processes Optimization

Exploration/Exploitation tradeoff 1 x 5? objective 0 1 (x 3, y 3) (x 4, y 4) (x 1, y 1) 2 (x 2, y 2) 1 0 1 parameter Contal 5/41 Gaussian Processes Optimization

Gaussian Processes Definition f GP(m, k) with mean function m : X R and covariance function k : X X R +, when for all x 1,..., x n X we have: ( ) ( [m(xi f (x 1 ),..., f (x n ) N ) ], [ k(x xi i, x j ) ] ). xi,xj Probabilistic Smoothness Assumption Nearby locations are highly correlated Large local variations have low probability Example of Covariance Function Squared Exponential RBF: k(x, y) = exp( x y 2 2 Rational Quadratic: k(x, y) = ( 1 + x y 2 2 2αl 2 ) 2l ) 2 α Contal 6/41 Gaussian Processes Optimization

Gaussian Processes: Examples Figure: Samples of 1D Gaussian Processes with different covariance functions Contal 7/41 Gaussian Processes Optimization

Posterior Distribution Bayesian Inference Rasmussen and Williams (2006) Given the observations Y t = [y 1,..., y t ] at the query locations X t = (x 1,..., x t ) we compute for all x X : µ t (x) := E[f (x) X t, Y t ] = k t (x) C 1 t Y t σt 2 (x) := V[f (x) X t, Y t ] = k(x, x) k t (x) C 1 t k t (x) where C t = K t + η 2 I and K t = [k(x i, x j )] xi,x j X t Interpretation posterior mean µ t : prediction posterior deviation σ t : uncertainty Contal 8/41 Gaussian Processes Optimization

Gaussian Processes Confidence Bounds Contal 9/41 Gaussian Processes Optimization

Setup: Summary Assumptions f GP(0, k) with known covariance k, y t x t + ɛ t where ɛ t N (0, η 2 ) with known η. Regrets { } S T = min f f (x t ), t T R T = ( ) f f (x t ). t T Contal 10/41 Gaussian Processes Optimization

Related Work Bayesian Optimization Bull (2011): Expected Improvement Algorithm Hennig et al. (2012): Entropy Search Algorithm Upper Confidence Bounds Freitas et al. (2012): Deterministic GP Srinivas et al. (2012): GP-UCB Djolonga et al. (2013): High-dimensional GP Chaining Grunewalder et al. (2010): Known horizon bandits Gaillard and Gerchinovitz. (2015): Online regression Contal 11/41 Gaussian Processes Optimization

Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

Upper Confidence Bounds (UCB) Strategy: Upper Confidence Bounds (UCB) If we have with high probability, f f (x t ) UCB t (x t ), then we can control the regret: R T t T UCB t (x t ) and S T 1 T UCB t (x t ). t T Contal 13/41 Gaussian Processes Optimization

Canonical Pseudo-Distance A First UCB Fix x X, let d 2 t (x, x t ) = V[f (x ) f (x t ) X t, Y t ] = σ 2 t (x ) 2k t (x, x t ) + σ 2 t (x t ). For all δ > 0, set β δ = 2 log δ 1, with probability at least 1 δ: Union Bound over X f (x ) f (x t ) µ t (x ) µ t (x t ) + β δ d t (x, x t ). With X <, we have with probability at least 1 X δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + β δ d t (x, x t ). x X x X Contal 14/41 Gaussian Processes Optimization

Canonical Pseudo-Distance A First UCB Fix x X, let d 2 t (x, x t ) = V[f (x ) f (x t ) X t, Y t ] = σ 2 t (x ) 2k t (x, x t ) + σ 2 t (x t ). For all δ > 0, set β δ = 2 log δ 1, with probability at least 1 δ: Union Bound over X f (x ) f (x t ) µ t (x ) µ t (x t ) + β δ d t (x, x t ). With X <, we have with probability at least 1 X δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + β δ d t (x, x t ). x X x X But what if X =? Contal 14/41 Gaussian Processes Optimization

Covering Numbers ε-net T X is an ε-net of X for d t iff: x X, x T s.t. d t (x, x ) ε Covering Numbers The covering number N(X, d t, ε) is the size of the smallest ε-net. Contal 15/41 Gaussian Processes Optimization

An ε-net for the Euclidean Distance X ε Contal 16/41 Gaussian Processes Optimization

Hierarchical Covers Assumption (w.l.o.g) x, y X, k(x, y) 1. Since d t (x, y) k(x, y), any point of X is a 1-net of X for d t. Hierarchical Covers Let T = (T i ) i 0 such that for all i 0: T i is an ε i -net with ε i = 2 i, T i T i+1. Contal 17/41 Gaussian Processes Optimization

Hierarchical Covers: ɛ 0 = 1 Contal 18/41 Gaussian Processes Optimization

Hierarchical Covers: ɛ 1 = 1 2 Contal 18/41 Gaussian Processes Optimization

Hierarchical Covers: ɛ 2 = 1 4 Contal 18/41 Gaussian Processes Optimization

Hierarchical Covers: ɛ 3 = 1 8 Contal 18/41 Gaussian Processes Optimization

Localized Chaining Projection to T starting from x t toward x Define π i (x ) = argmin xi T i {x t} d t (x, x i ), then: π i (x ) i x, π i (x ) = x t if d t (x, x t ) < ε i. The Chaining Trick f (x ) f (x t ) = i:ɛ i <d t(x,x t) f (π i (x )) f (π i 1 (x )). Contal 19/41 Gaussian Processes Optimization

Localized Chaining Projection to T starting from x t toward x Define π i (x ) = argmin xi T i {x t} d t (x, x i ), then: π i (x ) i x, π i (x ) = x t if d t (x, x t ) < ε i. The Chaining Trick sup f (x ) f (x t ) = sup x X x X i:ɛ i <d t(x,x t) f (π i (x )) f (π i 1 (x )). Contal 19/41 Gaussian Processes Optimization

The Chaining Trick x x t Contal 20/41 Gaussian Processes Optimization

Upper Confidence Bound Converging distances d t ( πi (x ), π i 1 (x ) ) ε i 1 UCB at depth i: Union Bound on T i For any i 1, with probability at least 1 T i δ: sup f (π i (x )) f (π i 1 (x )) sup µ t (π i (x )) µ t (π i 1 (x )) + β δ ε i 1 x X x X Final UCB Set β δ,i = 2 log ( i 2 T i δ 1), with probability at least 1 π2 6 δ: sup f (x ) f (x t ) sup µ t (x ) µ t (x t ) + ε i β δ,i. x X x X i:ε i <d t(x,x) Contal 21/41 Gaussian Processes Optimization

Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

The Chaining-UCB Algorithm Contal et al. (2015) UCB policy x t+1 argmax µ t (x) + ε i β δ,i. x X i:ε i <σ t(x) Practical Remark The algorithm needs only to compute the β δ,i for i where ε i > min x X σ t(x). Contal 23/41 Gaussian Processes Optimization

Upper Confidence Bound Contal 24/41 Gaussian Processes Optimization

Greedy Cover NP-hardness Computing the β δ,i requires to compute the hierarchical ε i -nets. Finding the smallest ε i -net is NP-hard. Greedy optimal approximation T ; X X while X = do x argmax x X { x X : d(x, x ) ε } T T {x} X X \ {x X : d(x, x ) ε} end return T Contal 25/41 Gaussian Processes Optimization

Theorem: Generic bounds for the Chaining-UCB algorithm For δ > 0, denoting σ t = σ t (x t ), we have t 1, c δ R: sup f (x ( ) ) f (x t ) σ t cδ 6 log σ t + 9 ε i log N(X, dt, ε i ), x X i:ε i <σ t with probability at least 1 δ. Contal 26/41 Gaussian Processes Optimization

Corollary when Controlling the Metric Entropy Assumption D R such that N(X, d n, ε) = O ( ( ) ε D). It suffices that d n (x, x ) = O x x 2 and X [0, R] D. e.g. Squared Exponential covariance, Matérn covariance,... Corollary ( ) sup f (x ) f (x t ) O Dσt log σt 1, x X ( T ) thus, R T O D σ t log σt 1. t=1 Contal 27/41 Gaussian Processes Optimization

Information Gain Lemma [Srinivas et al., 2012] T σt 2 O ( ) γ T, t=1 where γ T = max X X : X =T I (X ), the maximum information gain on f by a set of T observations. For GP, I (X ) = 1 2 log det ( I + η 2 K X ). Upper Bounds Linear covariance k(x, y) = x y, γ T O ( D log T ) Squared exp covariance k(x, y) = e 1 2 x y 2 2, γt O ( (logt ) D+1) Matérn covariance with parameter ν > 1, γ T O ( (logt )T a), with a = D(D+1) 2ν+D(D+1) < 1. Contal 28/41 Gaussian Processes Optimization

Corollary for the regret Linear covariance, R T O ( DT log T ) Squared exp covariance, R T O ( T (log T ) D+2) Matérn covariance, R T O ( log T T a), with a = ν+d+d2 2ν+D+D 2. Contal 29/41 Gaussian Processes Optimization

Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

Experiments 1/4: Himmelblau s function 10 0 Chaining UCB GP UCB Random 10 1 S n 10 2 10 3 0 10 20 30 40 50 60 70 80 90 100 Iteration n Contal 31/41 Gaussian Processes Optimization

Experiments 2/4: SE kernel Chaining UCB GP UCB Random 10 0 S n 10 1 0 10 20 30 40 50 60 70 80 90 100 Iteration n Contal 32/41 Gaussian Processes Optimization

Experiments 3/4: Wave Energy Converter 10 0 Chaining UCB GP UCB Random 10 1 S n 10 2 10 3 0 20 40 60 80 100 120 140 160 180 200 Iteration n Contal 33/41 Gaussian Processes Optimization

Experiments 4/4: Graphs kernel S n 10 0 10 1 10 2 Chaining UCB GP UCB Random f( ) = 1.3 f( ) = 0.2 f( ) = 2.7 10 3 0 10 20 30 40 50 60 70 80 Iteration n Contal 34/41 Gaussian Processes Optimization

Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm and Theoretical Results Experiments on real and synthetic data sets Further Results: Quadratic Forms, Noise-free Optimization and Lower Bounds

Optimization of Other Stochastic Processes Minimal assumption f : X R a stochastic process with pseudo-distance d(, ) and ψ u ( ) such that: [ Pr f (x) f (x ( ) > ψ u d(x, x ) )] < e u. Example: Quadratic Form of GP Applications f (x) = N i=1 g 2 i (x) where g i GP(0, k i ). Optimization of a costly mean-square-error Optimization of a costly Gaussian likelihood (Bayesian model calibration) Contal 36/41 Gaussian Processes Optimization

Noise-free Optimization Problem setting f GP(0, k) and y t = f (x t ). Algorithm Pre-compute the hierarchical ε i -nets for d 0 and build the tree. For x in the tree, let δ (x) = i>depth(x) ɛ iβ δ,i. Evaluate f at the root. Loop x t+1 = argmax x Childs(x1,...,x t) f (x) + δ (x). Contal 37/41 Gaussian Processes Optimization

Noise-free Optimization: Results Property With probability at least 1 δ, f f (x t ) δ (x t ). Lemma { x t : ɛ δ (x t ) ɛ ( 1 + Depth(x t ) ) } 1 2 O(1). Theorem (Ongoing Work) If N(X, d 0, ε) = O ( ε D), for the previous algorithm, R T = O(1) and S T = O(e T ). Contal 38/41 Gaussian Processes Optimization

Lower Bounds on the Supremum of the GP Reminder: UCB rewritten With probability at least 1 δ, for all x at depth h in the tree, sup f (x ) f (x) cste ɛ i β δ,i. x B(x,ɛ h ) i>h Theorem: LCB (Ongoing Work) With probability at least 1 δ, for all x at depth h in the tree, sup f (x ) f (x) cste ɛ i β δ,i. x B(x,ɛ h ) i>h Contal 39/41 Gaussian Processes Optimization

Conclusion The Chaining-UCB Algorithm automatic calibration of the exploration/exploitation tradeoff adapts to various settings computationally tractable Matlab code online http://econtal.perso.math.cnrs.fr/software/ Contal 40/41 Gaussian Processes Optimization

Contal, E., Malherbe, C., and Vayatis, N. (2015). Optimization for gaussian processes via chaining. NIPS Workshop on Bayesian Optimization. Munos, R. (2011). Optimistic optimization of deterministic functions without the knowledge of its smoothness. In Advances in neural information processing systems (NIPS). Rasmussen, C. E. and Williams, C. (2006). Gaussian Processes for Machine Learning. MIT Press. Srinivas, N., Krause, A., Kakade, S., and Seeger, M. (2012). Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250 3265. Contal 41/41 Gaussian Processes Optimization