Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp

Size: px

Start display at page:

Download "Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp"

Debra Taylor
6 years ago
Views:

1 .. Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp Jialei Wang 1 Peter Frazier 1 Scott Clark 2 Eric Liu 2 1 School of Operations Research & Information Engineering, Cornell University 2 Yelp Inc. Friday August 15, 2014 MOPTA 2014 Lehigh, Bethlehem, PA

Metrics Optimization Engine (MOE) A global optimization toolbox for real world metric optimization. Developed by engineers at Yelp and is recently open sourced, available at http://yelp.github.io/moe.

2 Metrics Optimization Engine (MOE) A global optimization toolbox for real world metric optimization. Developed by engineers at Yelp and is recently open sourced, available at For discrete domain, MOE models the problem as Multi-armed Bandits problem and offers a number of algorithms (we are not discussing it in this talk). If function domain is continuous, MOE treats the problem as Derivative-Free Black-box Global Optimization problem. Our contribution is to develop a parallel Bayesian Global Optimization algorithm.

3 Derivative-Free Black-box Global Optimization y(n) Objective function f : R d R, continuous but not concave. Our goal is to find a global optimum, max f (x) x A f(x) X* Assumptions: f is black-box and we can only evaluate it at points of interest. Each evaluation is time-consuming(hours or days), and derivative information is unavailable. Feasible set A R d. (cheap to evaluate constraints)

Example: Deep learning methods. Example: Hyperparameters of features used in machine-learning model.

4 Use cases of MOE Schneider National 2008 Warren B. Powell Slide 113 Optimizing tunable parameters of a machine-learning prediction model. Example: Deep learning methods. Example: Hyperparameters of features used in machine-learning model. Optimizing the design of an engineering system or parameters in physical experiments. Example: Optimizing the concentrations of chemicals, temperature and pressure for material design.

5 MOE uses Bayesian Global Optimization for solving Derivative-Free Black-Box Global Optimization problems Bayesian Global Optimization (BGO) is a class of methods for solving derivative-free black-box global optimization. In BGO, we place a Bayesian prior distribution on the objective function f. (MOE uses Gaussian process prior). Ideally, we would find an algorithm with optimal average-case performance under this prior. We will settle for an algorithm with good average-case performance. In MOE, we use Expected Improvement algorithm to decide where to sample next.

6 Work flow of MOE

7 Background: Gaussian Process

8 Background: Gaussian Process

9 Background: Expected Improvement Expected Improvement is a measure of how much information you will obtain if sampling at x, and is defined as [ EI n ( x) = E n (f ( x) f n ) +] 2 1 value x EI

This is the case with parallel computing, and in many experimental settings (particularly in

10 Case of multiple simultaneous function evaluations Cornell Tardis Cluster What if we can perform multiple function evaluations simultaneously? This is the case with parallel computing, and in many experimental settings (particularly in biology). David Ginsbourger suggested to extend sequential EI to q-ei, written as [ ( ) ] + EI n ( x 1,..., x q ) = E n max f ( x i) fn i=1,...,q BIAcore machine

11 q-ei is hard to optimize To find the set of points to evaluate next, we would like to solve max x1,..., x q EI( x 1,..., x q ). However, when q > 2, q-ei has no general closed form expression, and therefore it does not have derivative information. Directly optimizing the q-ei becomes extremely expensive as q and d (the dimension of inputs) grow.

12 Our Contribution Our contribution is an efficient method for solving arg max x 1,..., x q EI( x 1,..., x q ) This transforms the Bayes optimal function evaluation plan, previously considered to be a purely conceptual algorithm, into something implementable.

13 Our approach to solving argmax x1,..., x q EI( x 1,..., x q ).1 Construct an unbiased estimater of EI( x 1,..., x q ) using infinitessimal perturbation analysis (IPA)..2 Use multistart stochastic gradient ascent to find an approximate solution to max x1,..., x q EI( x 1,..., x q ).

14 We construct an estimator of the gradient Using sufficient conditions described on the next slide, we switch and expectation to obtain our unbiased estimator of the gradient, where g (x 1,...,x q,z) = EI(x 1,...,x q,z) = Eg (x 1,...,x q,z), { [f (x 1,...,x q,z)] if [f (x 1,...,x q,z)]exists, 0 if does not exist, g( x 1,..., x q, Z) can be computed using results on differentiation of the Cholesky decomposition.

15 Our gradient estimator is unbiased, given sufficient conditions. Theorem. Let m( x 1,..., x q ) and C( x 1,..., x q ) be mean vector and Cholesky of covariance matrix of (f ( x 1 ),...,f ( x q )). If the following conditions hold then. m( x 1,..., x q ) and C( x 1,..., x q ) are three times continuously differentiable in a neighborhood of x 1,..., x q. C( x 1,..., x q ) has no duplicated rows. EI( x 1,..., x q ) = E n [g( x 1,..., x q, ] Z).

16 Example of Estimated Gradient

17 Multistart Stochastic Gradient Ascent.1 Select several starting points, uniformly at random..2 From each starting point, iterate using the stochastic gradient method until convergence. ( x 1,..., x q ) ( x 1,..., x q ) + α n g( x 1,..., x q,ω), where (α n ) is a stepsize sequence..3 For each starting point, average the iterates to get an estimated stationary point. (Polyak-Ruppert averaging).4 Select the estimated stationary point with the best estimated value as the solution. x 2 x 1

18 We can handle asynchronous function evaluations As previously described, if there are no function evaluations currently in progress, we solve max x 1,..., x q EI( x 1,..., x q ) to get the set to run next. If there are function evaluations already in progress, say x 1,..., x p, we take these as given and optimize the rest x p+1,..., x p+q. max EI( x 1,..., x p+q ) x p+1,..., x p+q This is implementated as q,p-ei in MOE.

19 GPU parallel programming speed up

20 GPU parallel programming speed up

21 Conclusion MOE is open-sourced software package for Derivative-free Black-box Global Optimization. We considered a previously proposed conceptual method for parallel Bayesian global optimization, proposed an efficient algorithm and implemented it in MOE.

Parallel Bayesian Global Optimization of Expensive Functions

Parallel Bayesian Global Optimization of Expensive Functions Jialei Wang 1, Scott C. Clark 2, Eric Liu 3, and Peter I. Frazier 1 arxiv:1602.05149v3 [stat.ml] 1 Nov 2017 1 School of Operations Research