Distributionally robust optimization techniques in batch bayesian optimisation Nikitas Rontsis June 13, 2016 1 Introduction This report is concerned with performing batch bayesian optimization of an unknown function f. Using a Gaussian process (GP) framework to estimate the function, we are searching for the best batch of k points where the function will be evaluated. According to [3], the expected loss of a specific choice of k points involves the expensive integration over multidimensional regions. We reformulate the problem using worst case expectation techniques with second-order moment information [6]. This reformulation is a conservative approximation of the original problem that considers all the distributions with a given mean and covariance, including the Gaussian distribution which is assumed in the original problem. This reformulation is free from the expensive integrations of the original problem. We show that this formulation is overly conservative returning trivial solutions. A solution to avoid this problem is introduced by bounding the support of the distributions that are considered in the worst case expectation. However, this leads to a semi-infinite optimization problem. Sum-of-square techniques are suggested as a possible relaxation. 2 Initial Definitions Let f : R n R be a smooth function to be minimized. Assume that l points y i = f(x i ) have been gathered so far, forming a dataset D 0 = (x i, y i )} = (X 0, y 0 ) with X 0 R l n and y 0 R n. In order to find the next k points X R k n where 1
the function will be evaluated, a GP is used to build a statistical picture of the function s form. For an overview of Gaussian processes see [5]. The properties of a GP are determined by a prior mean function m(x) = E(f(x)) which, without loss of generality, can be assumed to be zero, and a prior positive semi-definite covariance function k(x, x ) = E((f(x) m(x))(f(x ) m(x ))). Given these, GP dictates the following probability distribution for the function values y of the k selected points X with mean value and variance y D 0 N (µ(x), Σ(X)) (1) µ(x) = K(X 0, X) (K(X 0, X 0 ) + σ 2 ni) 1 y 0, (2) Σ(X) = K(X, X) K(X 0, X) (K(X 0, X 0 ) + σ 2 ni) 1 K(X 0, X), (3) where K(A, B) (i,j) = k(a i, B j ), i.e. each element of the matrix K(A, B) is the covariance between the i-th point of A and the j-th point of B. The mean value µ and the variance Σ depend also on the prior D 0, but we do not explicitly denote their dependence in order to keep the notation uncluttered. 3 Expected Loss Function The expectation of the next evaluation is with η = min y 0. This can be reformulated as Λ(X D 0 ) = E(miny, η)} (4) Λ(X D 0 ) = η N (y; µ, Σ)dy + C 0 k C i y i N (y; µ, Σ)dy, (5) where the integrals are taken over C 0 = y R k y j η, j = 1... k } and C i = y R k y i η y i y j, j = 1... k }. In order to calculate the above k+1 k-dimensional integrals, [3] suggests using Expectation Propagation. However, Expectation Propagation is an approximate and expensive operation that needs to be performed in every step of the optimization algorithm used to minimize (4), considerably increasing the complexity of the resulting global optimization procedure. 2
4 Worst Case Expectation Formulation In this section we derive conservative approximations for the minimization of (4) using worst case expectation techniques, in order to avoid the multidimensional integrations that are present in (5). 4.1 Generic Formulation Define the set P(µ, Σ) of all probability distributions on R k with given mean vector µ and covariance matrix Σ 0. Then, let sup P P E P (g(ξ)) denote the worst case expectation of a measurable function g : R k R. The worst case expectation can be described by the following optimization problem [6] θ wc = sup g(ξ)ν(dξ) ν M + R k subject to ν(dξ) = 1 Rk (6) ξν(dξ) = µ Rk R k ξξ ν(dξ) = Σ + µµ, where M + represents the cone of nonnegative Borel measures on R k. This is a linear program with infinite dimensional variables and finite constraints. Its dual has finite number of variables and infinite constraints, while exhibiting zero duality gap [6]. Hence we can equivalently focus on the dual problem to (5) inf Ω, M subject to [ ξ 1 ] M [ ξ 1 ] g(ξ) ξ R k, with M S k+1 as the variable and Ω S k+1 denoting the second order moment matrix of ξ ( ) Σ + µµ µ Ω = µ. 1 We use Ω, M for the trace inner product. The matrix M consists of the Lagrange multipliers Y R p p, y R k and y 0 R that correspond to the equality constraints for the covariance matrix, the mean vector and ν integrating to one, i.e. ( ) Y y M = y y 0 3 (7)
4.2 Concave piecewise affine function When g(ξ) is a piecewise affine function, the infinite collection of constraints in (7) parameterized by ξ can be eliminated with techniques used in [6, Theorem 2.3]. First, note that a concave piecewise affine function can be reformulated as the minimum of a linear function over the probability simplex [2, Exercise 4.8] min,,l (a i + b i ξ) = min λ 1...l λ i (a i + b i ξ). (8) Combining this result with the Minmax Lemma [1, Lemma D 4.1], which allows us to swap the order of minimization and maximization, we will convert the infinite dimensional constraint to a linear matrix inequality. The Minmax Lemma requires [ ξ 1 ] M [ ξ 1 ] to be a convex function of ξ, i.e. Y 0. This is necessary when g(ξ) is concave piecewise affine, as a negative eigenvalue in Y will result in a violation of the inequality along the direction of the corresponding eigenvalue. For example, in the particular case of g(ξ) = minξ, η} we can reformulate the constraint as following [ ξ 1 ] M [ ξ 1 ] minξ, η} ξ R k [ ξ 1 ] M [ ξ 1 ] ( k ) min λ i ξ i + λ k+1 η 0, ξ R k λ [ξ min max 1 ] M [ ξ 1 ] ( k )} λ i ξ i + λ k+1 η 0 ξ R k λ [ξ max min 1 ] M [ ξ 1 ] ( k )} λ i ξ i + λ k+1 η 0 λ ξ R k [ξ min 1 ] M [ ξ 1 ] ( k )} λ i ξ i + λ k+1 η 0, for a λ ξ R k [ M where = in R k+1. 0 λ 1,...,k /2 λ 1,...,k /2 λ k+1η ] 0, for a λ, λ R k+1 : } k+1 λ i = 1, λ 0 denotes the probability simplex 4
4.3 Results for the case of batch bayesian optimization Using the previous results we can derive the following tractable optimization problem that is a conservative approximation of the minimization of (5) inf Ω(X), M [ 0 λ subject to M 1,...,k /2 λ 1,...,k /2 λ k+1η ] 0, λ (9) with variables M S k+1, X R k n, and λ R k+1. The dependence of Ω to X is, in general, complex. Only in very simple cases it can be convex. For example, in Appendix A we show that when k = 1 and the kernel used in the GP is linear, then the objective of the optimization problem is convex separately on (X) and (M, λ). Assume that h(x) is the result of the optimization problem (9) when minimizing only over M. This minimization can be solved globally and with standard software tools, because it is a semidefinite optimization problem. In an upper level, we pass the function h(x) to a nonlinear solver, thus optimizing the whole problem. As a result, the task of the nonlinear solver was reduced to optimizing over X. Unfortunately, the worst case expectation achieved by (9) is trivially equal to minµ, η}, as we can deduce from the following proposition. Proposition 4.1. For a concave piecewise affine function g(ξ) = min,...,l (a i + b i ξ) the optimal value of (6) is sup P P E P(g(ξ)) = min,...,l (a i + b i µ). Proof. First note that g(ξ) is bounded above ( ) E min i + b i ξ),...,l min i + b i ξ) = min i + b i µ),...,l,...,l (10) We will construct a distribution that achieves the above mentioned upper bound. Assume the one-dimensional, uncorrelated random variables z, w ( z U 1 ɛ, 1 ), w N (0, ɛ), (11) ɛ i.e. z is uniformly distributed in ( ɛ 1, ɛ 1 ), and w is a zero mean Gaussian with variance ɛ, where ɛ R ++. 5
Now, assuming 0 < ɛ distribution x = 1 3, consider the random variable x with the mixture z with probability 3ɛ 2 w with probability 1 3ɛ 2 (12) Since both of the mixing distributions are zero mean, the resulting distribution is zero mean with variance E(x 2 ) = 3ɛ 2 E(z 2 ) + (1 3ɛ 2 )E(w 2 ) = 1 + ɛ(1 3ɛ 2 ) (13) In the limit ɛ 0 the random variable x has zero mean and variance one, but its probability distribution function is infinitesimal everywhere outside the origin. Assuming x is a vector of independent variables distributed identically to x for ɛ 0, the random vector ξ = Σ 1/2 x + µ has covariance matrix Σ and mean value µ, with its probability distribution being infinitesimal everywhere expect in µ. For this random vector the inequality (14) holds tightly. It is worth noting that when g(ξ) = max,...,l (a i + b i ξ), i.e. a convex piecewise affine function, we can have a similar lower bound ( ) E max (a i + b i ξ),...,l max E(a i + b i ξ) = max (a i + b i µ) (14),...,l,...,l This bound is also tight: it is achieved with the random vector x. However, this is of no practical importance since we are performing a maximization. 4.4 Bounded support set One way to avoid these problematic distributions that achieve trivial solutions is to bound the support of the distribution. One possible choice would be to enforce each distribution ν P to be supported only in the following set S = ξ R k (ξ µ) Σ 1 (ξ µ) < α 2 } (15) which for α = 3 and a Gaussian probability measure N (µ, Σ) includes nearly all the mass. Under this constraint, the dual problem (7) is reformulated to inf Ω, M subject to [ ξ 1 ] M [ ξ 1 ] min (a i + b i ξ) ξ S (16),...,l 6
Unfortunately in this problem [ ξ 1 ] M [ ξ 1 ] is not necessarily convex (Y is, in general, indefinite). Hence we cannot apply the Minmax Lemma to reduce the inifinite constraints to linear matrix inequalities. The infinite number of constraints can be eliminated conservatively by sum of squares techniques using the following result Proposition 4.2. By the Positivstellensatz [4] the following set is empty h ξ R k i (ξ) = [ ξ 1 ] M [ ξ 1 ] (ai + b i ξ) 0, i = 1,..., l } h l+1 (ξ) = (ξ µ) Σ 1 (ξ µ) α 2 0 (17) if and only if there exist globally positive polynomials s i, p i,j such that l+1 1 = s 0 (ξ) s i (ξ)h i (ξ) + i j p i,j (ξ)h i (ξ)h j (ξ) ξ R k (18) The proof is a straightforward specialization of the results in [4]. Under the typical sum of squares relaxation [4], we choose a vector z of monomials, and we represent s i as z L i z with L i 0 (similarly for p i,j ). Then we equate the elements of L i such that the resulting polynomial equation dictated by Proposition 4.2 holds. However, in our case the coefficients of the sum of squares polynomials are multiplied by the elements of M, resulting in a bilinearity, making the problem difficult to solve. 5 Conclusions The main goal of this mini project was to explore the connections between distributionally robust optimization techniques and Gaussian processes. To this end, a better understanding of both fields was obtained, encouraging further exploration in this area. Probably the main problem that constrained from having successful results was the one described in Proposition 4.1. One easier way to avoid this problem might be instead of considering the worst case expectation to consider the best case expectation, which does not exhibit the problems described in Proposition 4.1. 7
References [1] Aharon Ben-Tal and Arkadiaei Semenovich Nemirovskiaei. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001. [2] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. [3] Javier González, Michael A. Osborne, and Neil D. Lawrence. GLASSES : Relieving The Myopia Of Bayesian Optimisation. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016. [4] Pablo A Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, Citeseer, 2000. [5] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. [6] Steve Zymler, Daniel Kuhn, and Berç Rustem. Distributionally robust joint chance constraints with second-order moment information. Mathematical Programming, 137(1):167 198, 2011. 8
A Linear kernel The linear kernel is defined as k(x, x ) = σb 2 + σ2 ν(x c) (x c). This kernel is non-stationary. The hyperparameter c determines the point that all lines in the posterior go through. The hyperparameter σb 2 specifies the absolute value of the function at zero by putting a prior on it. When only one future point is considered (k = 1), then the optimization problem (9) can be simplified as follows. We denote by 1 a vector with all components one. Define A = X 0 c1, B = ( ( ) AA σ 2I ( ) + n σν + σ 211 1 b ) and w = x c. The resulting one dimensional variance of (1) is given by [ ] ( )[ ] B σ 2 (x) = σb 2 + σνw 2 w σνaw 2 + σb 2 1 σ σ νaw 2 + σ 2 ν 2 b 1 ( ) σ = σb 2 + σνw 2 (I A BA)w 2σb 2 1 2 2 (19) BAw b 1 B1, which, as we will prove below, is a positive definite quadratic. 1 First, note that for E S n ++, F S n + the following equivalence holds E F = z z z E 1/2 FE 1/2 z z R n = z E 1/2 F + E 1/2 z z z z R(F ) = z F + z z E 1 z z R(F ), where E 1/2 denotes the principal square root of E, F + the pseudoinvesre of F, and R(F ) the row space of F. Using the above result, we have ( ) 2 ( ) ) 2 z (AA σn σb + I + 11 z z AA z z R n = z (AA + ( σn ) 2 I + ( σb ) 2 11 ) 1 z z (AA ) + z z R(A A) = z A BAz za (AA ) + Az z R n, 1 This can also be proved very easily by noting that σ 2 (x) is in quadratic form and, as a valid variance function, it is always positive. However we leave the alternative proof to provide a better insight. 9
since Az R(A A) z R n. Finally, note that A (AA ) + A I since z A (AA ) + z Iz, z R(A ) Az = 0, z N (A). (20) Hence, we conclude that A BA I and, as a result, (19) is a positive definite quadratic. The one dimensional mean function µ(x) of (1) is a linear function of x and is given by [ ( ) ] 2 σb µ(x) = Aw + 1 By 0. (21) As a result, the objective function of (9) is quadratic separately on (x) and (M, λ). 10