Distributionally robust optimization techniques in batch bayesian optimisation

Similar documents
Lecture Note 5: Semidefinite Programming for Stability Analysis

Constrained Optimization and Lagrangian Duality

Probabilistic & Unsupervised Learning

Distributionally Robust Convex Optimization

Gaussian Processes (10/16/13)

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Lecture Notes on Support Vector Machine

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

On deterministic reformulations of distributionally robust joint chance constrained optimization problems

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Semidefinite Programming

Convex Optimization. (EE227A: UC Berkeley) Lecture 28. Suvrit Sra. (Algebra + Optimization) 02 May, 2013

Lecture 6: Conic Optimization September 8

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

arxiv: v3 [stat.ml] 24 Oct 2017

Semidefinite Programming Basics and Applications

4. Convex optimization problems

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Applications of Linear Programming

Optimality Conditions for Constrained Optimization

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

L5 Support Vector Classification

Robust Fisher Discriminant Analysis

Optimization of Gaussian Process Hyperparameters using Rprop

Learning Gaussian Process Models from Uncertain Data

Lecture: Examples of LP, SOCP and SDP

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

A semidefinite relaxation scheme for quadratically constrained quadratic problems with an additional linear constraint

Linear and non-linear programming

Optimization for Machine Learning

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

4. Algebra and Duality

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Optimality, Duality, Complementarity for Constrained Optimization

Lecture: Convex Optimization Problems

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Variational Principal Components

CS-E4830 Kernel Methods in Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

A new look at nonnegativity on closed sets

Chapter 2: Linear Programming Basics. (Bertsimas & Tsitsiklis, Chapter 1)

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Convex Optimization and Support Vector Machine

Support Vector Machines

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Exact SDP Relaxations for Classes of Nonlinear Semidefinite Programming Problems

Prediction of double gene knockout measurements

STAT 518 Intro Student Presentation

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

EE Applications of Convex Optimization in Signal Processing and Communications Dr. Andre Tkacenko, JPL Third Term

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Convex Optimization & Lagrange Duality

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

arxiv: v1 [math.oc] 31 Jan 2017

Gaussian Process Regression

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Trust Region Problems with Linear Inequality Constraints: Exact SDP Relaxation, Global Optimality and Robust Optimization

Extending the Scope of Robust Quadratic Optimization

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Lecture 1. 1 Conic programming. MA 796S: Convex Optimization and Interior Point Methods October 8, Consider the conic program. min.

Characterizing Robust Solution Sets of Convex Programs under Data Uncertainty

E5295/5B5749 Convex optimization with engineering applications. Lecture 5. Convex programming and semidefinite programming

Optimization Theory. A Concise Introduction. Jiongmin Yong

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

ICS-E4030 Kernel Methods in Machine Learning

Module 04 Optimization Problems KKT Conditions & Solvers

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Relaxations and Randomized Methods for Nonconvex QCQPs

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

Gaussian Processes for Machine Learning

On self-concordant barriers for generalized power cones

Support Vector Machines for Regression

4. Convex optimization problems

Gaussian random variables inr n

Lecture 1 October 9, 2013

Support Vector Machines, Kernel SVM

Support Vector Machine (SVM) and Kernel Methods

6-1 The Positivstellensatz P. Parrilo and S. Lall, ECC

Lecture 3: Semidefinite Programming

Structure of Valid Inequalities for Mixed Integer Conic Programs

Reliability Monitoring Using Log Gaussian Process Regression

Lecture 2: Convex Sets and Functions

Support Vector Machine (continued)

Convex Optimization in Classification Problems

MIT Algebraic techniques and semidefinite optimization February 14, Lecture 3

Convex optimization problems. Optimization problem in standard form

Lecture 8 Plus properties, merit functions and gap functions. September 28, 2008

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Supplementary Material

Transcription:

Distributionally robust optimization techniques in batch bayesian optimisation Nikitas Rontsis June 13, 2016 1 Introduction This report is concerned with performing batch bayesian optimization of an unknown function f. Using a Gaussian process (GP) framework to estimate the function, we are searching for the best batch of k points where the function will be evaluated. According to [3], the expected loss of a specific choice of k points involves the expensive integration over multidimensional regions. We reformulate the problem using worst case expectation techniques with second-order moment information [6]. This reformulation is a conservative approximation of the original problem that considers all the distributions with a given mean and covariance, including the Gaussian distribution which is assumed in the original problem. This reformulation is free from the expensive integrations of the original problem. We show that this formulation is overly conservative returning trivial solutions. A solution to avoid this problem is introduced by bounding the support of the distributions that are considered in the worst case expectation. However, this leads to a semi-infinite optimization problem. Sum-of-square techniques are suggested as a possible relaxation. 2 Initial Definitions Let f : R n R be a smooth function to be minimized. Assume that l points y i = f(x i ) have been gathered so far, forming a dataset D 0 = (x i, y i )} = (X 0, y 0 ) with X 0 R l n and y 0 R n. In order to find the next k points X R k n where 1

the function will be evaluated, a GP is used to build a statistical picture of the function s form. For an overview of Gaussian processes see [5]. The properties of a GP are determined by a prior mean function m(x) = E(f(x)) which, without loss of generality, can be assumed to be zero, and a prior positive semi-definite covariance function k(x, x ) = E((f(x) m(x))(f(x ) m(x ))). Given these, GP dictates the following probability distribution for the function values y of the k selected points X with mean value and variance y D 0 N (µ(x), Σ(X)) (1) µ(x) = K(X 0, X) (K(X 0, X 0 ) + σ 2 ni) 1 y 0, (2) Σ(X) = K(X, X) K(X 0, X) (K(X 0, X 0 ) + σ 2 ni) 1 K(X 0, X), (3) where K(A, B) (i,j) = k(a i, B j ), i.e. each element of the matrix K(A, B) is the covariance between the i-th point of A and the j-th point of B. The mean value µ and the variance Σ depend also on the prior D 0, but we do not explicitly denote their dependence in order to keep the notation uncluttered. 3 Expected Loss Function The expectation of the next evaluation is with η = min y 0. This can be reformulated as Λ(X D 0 ) = E(miny, η)} (4) Λ(X D 0 ) = η N (y; µ, Σ)dy + C 0 k C i y i N (y; µ, Σ)dy, (5) where the integrals are taken over C 0 = y R k y j η, j = 1... k } and C i = y R k y i η y i y j, j = 1... k }. In order to calculate the above k+1 k-dimensional integrals, [3] suggests using Expectation Propagation. However, Expectation Propagation is an approximate and expensive operation that needs to be performed in every step of the optimization algorithm used to minimize (4), considerably increasing the complexity of the resulting global optimization procedure. 2

4 Worst Case Expectation Formulation In this section we derive conservative approximations for the minimization of (4) using worst case expectation techniques, in order to avoid the multidimensional integrations that are present in (5). 4.1 Generic Formulation Define the set P(µ, Σ) of all probability distributions on R k with given mean vector µ and covariance matrix Σ 0. Then, let sup P P E P (g(ξ)) denote the worst case expectation of a measurable function g : R k R. The worst case expectation can be described by the following optimization problem [6] θ wc = sup g(ξ)ν(dξ) ν M + R k subject to ν(dξ) = 1 Rk (6) ξν(dξ) = µ Rk R k ξξ ν(dξ) = Σ + µµ, where M + represents the cone of nonnegative Borel measures on R k. This is a linear program with infinite dimensional variables and finite constraints. Its dual has finite number of variables and infinite constraints, while exhibiting zero duality gap [6]. Hence we can equivalently focus on the dual problem to (5) inf Ω, M subject to [ ξ 1 ] M [ ξ 1 ] g(ξ) ξ R k, with M S k+1 as the variable and Ω S k+1 denoting the second order moment matrix of ξ ( ) Σ + µµ µ Ω = µ. 1 We use Ω, M for the trace inner product. The matrix M consists of the Lagrange multipliers Y R p p, y R k and y 0 R that correspond to the equality constraints for the covariance matrix, the mean vector and ν integrating to one, i.e. ( ) Y y M = y y 0 3 (7)

4.2 Concave piecewise affine function When g(ξ) is a piecewise affine function, the infinite collection of constraints in (7) parameterized by ξ can be eliminated with techniques used in [6, Theorem 2.3]. First, note that a concave piecewise affine function can be reformulated as the minimum of a linear function over the probability simplex [2, Exercise 4.8] min,,l (a i + b i ξ) = min λ 1...l λ i (a i + b i ξ). (8) Combining this result with the Minmax Lemma [1, Lemma D 4.1], which allows us to swap the order of minimization and maximization, we will convert the infinite dimensional constraint to a linear matrix inequality. The Minmax Lemma requires [ ξ 1 ] M [ ξ 1 ] to be a convex function of ξ, i.e. Y 0. This is necessary when g(ξ) is concave piecewise affine, as a negative eigenvalue in Y will result in a violation of the inequality along the direction of the corresponding eigenvalue. For example, in the particular case of g(ξ) = minξ, η} we can reformulate the constraint as following [ ξ 1 ] M [ ξ 1 ] minξ, η} ξ R k [ ξ 1 ] M [ ξ 1 ] ( k ) min λ i ξ i + λ k+1 η 0, ξ R k λ [ξ min max 1 ] M [ ξ 1 ] ( k )} λ i ξ i + λ k+1 η 0 ξ R k λ [ξ max min 1 ] M [ ξ 1 ] ( k )} λ i ξ i + λ k+1 η 0 λ ξ R k [ξ min 1 ] M [ ξ 1 ] ( k )} λ i ξ i + λ k+1 η 0, for a λ ξ R k [ M where = in R k+1. 0 λ 1,...,k /2 λ 1,...,k /2 λ k+1η ] 0, for a λ, λ R k+1 : } k+1 λ i = 1, λ 0 denotes the probability simplex 4

4.3 Results for the case of batch bayesian optimization Using the previous results we can derive the following tractable optimization problem that is a conservative approximation of the minimization of (5) inf Ω(X), M [ 0 λ subject to M 1,...,k /2 λ 1,...,k /2 λ k+1η ] 0, λ (9) with variables M S k+1, X R k n, and λ R k+1. The dependence of Ω to X is, in general, complex. Only in very simple cases it can be convex. For example, in Appendix A we show that when k = 1 and the kernel used in the GP is linear, then the objective of the optimization problem is convex separately on (X) and (M, λ). Assume that h(x) is the result of the optimization problem (9) when minimizing only over M. This minimization can be solved globally and with standard software tools, because it is a semidefinite optimization problem. In an upper level, we pass the function h(x) to a nonlinear solver, thus optimizing the whole problem. As a result, the task of the nonlinear solver was reduced to optimizing over X. Unfortunately, the worst case expectation achieved by (9) is trivially equal to minµ, η}, as we can deduce from the following proposition. Proposition 4.1. For a concave piecewise affine function g(ξ) = min,...,l (a i + b i ξ) the optimal value of (6) is sup P P E P(g(ξ)) = min,...,l (a i + b i µ). Proof. First note that g(ξ) is bounded above ( ) E min i + b i ξ),...,l min i + b i ξ) = min i + b i µ),...,l,...,l (10) We will construct a distribution that achieves the above mentioned upper bound. Assume the one-dimensional, uncorrelated random variables z, w ( z U 1 ɛ, 1 ), w N (0, ɛ), (11) ɛ i.e. z is uniformly distributed in ( ɛ 1, ɛ 1 ), and w is a zero mean Gaussian with variance ɛ, where ɛ R ++. 5

Now, assuming 0 < ɛ distribution x = 1 3, consider the random variable x with the mixture z with probability 3ɛ 2 w with probability 1 3ɛ 2 (12) Since both of the mixing distributions are zero mean, the resulting distribution is zero mean with variance E(x 2 ) = 3ɛ 2 E(z 2 ) + (1 3ɛ 2 )E(w 2 ) = 1 + ɛ(1 3ɛ 2 ) (13) In the limit ɛ 0 the random variable x has zero mean and variance one, but its probability distribution function is infinitesimal everywhere outside the origin. Assuming x is a vector of independent variables distributed identically to x for ɛ 0, the random vector ξ = Σ 1/2 x + µ has covariance matrix Σ and mean value µ, with its probability distribution being infinitesimal everywhere expect in µ. For this random vector the inequality (14) holds tightly. It is worth noting that when g(ξ) = max,...,l (a i + b i ξ), i.e. a convex piecewise affine function, we can have a similar lower bound ( ) E max (a i + b i ξ),...,l max E(a i + b i ξ) = max (a i + b i µ) (14),...,l,...,l This bound is also tight: it is achieved with the random vector x. However, this is of no practical importance since we are performing a maximization. 4.4 Bounded support set One way to avoid these problematic distributions that achieve trivial solutions is to bound the support of the distribution. One possible choice would be to enforce each distribution ν P to be supported only in the following set S = ξ R k (ξ µ) Σ 1 (ξ µ) < α 2 } (15) which for α = 3 and a Gaussian probability measure N (µ, Σ) includes nearly all the mass. Under this constraint, the dual problem (7) is reformulated to inf Ω, M subject to [ ξ 1 ] M [ ξ 1 ] min (a i + b i ξ) ξ S (16),...,l 6

Unfortunately in this problem [ ξ 1 ] M [ ξ 1 ] is not necessarily convex (Y is, in general, indefinite). Hence we cannot apply the Minmax Lemma to reduce the inifinite constraints to linear matrix inequalities. The infinite number of constraints can be eliminated conservatively by sum of squares techniques using the following result Proposition 4.2. By the Positivstellensatz [4] the following set is empty h ξ R k i (ξ) = [ ξ 1 ] M [ ξ 1 ] (ai + b i ξ) 0, i = 1,..., l } h l+1 (ξ) = (ξ µ) Σ 1 (ξ µ) α 2 0 (17) if and only if there exist globally positive polynomials s i, p i,j such that l+1 1 = s 0 (ξ) s i (ξ)h i (ξ) + i j p i,j (ξ)h i (ξ)h j (ξ) ξ R k (18) The proof is a straightforward specialization of the results in [4]. Under the typical sum of squares relaxation [4], we choose a vector z of monomials, and we represent s i as z L i z with L i 0 (similarly for p i,j ). Then we equate the elements of L i such that the resulting polynomial equation dictated by Proposition 4.2 holds. However, in our case the coefficients of the sum of squares polynomials are multiplied by the elements of M, resulting in a bilinearity, making the problem difficult to solve. 5 Conclusions The main goal of this mini project was to explore the connections between distributionally robust optimization techniques and Gaussian processes. To this end, a better understanding of both fields was obtained, encouraging further exploration in this area. Probably the main problem that constrained from having successful results was the one described in Proposition 4.1. One easier way to avoid this problem might be instead of considering the worst case expectation to consider the best case expectation, which does not exhibit the problems described in Proposition 4.1. 7

References [1] Aharon Ben-Tal and Arkadiaei Semenovich Nemirovskiaei. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001. [2] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. [3] Javier González, Michael A. Osborne, and Neil D. Lawrence. GLASSES : Relieving The Myopia Of Bayesian Optimisation. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016. [4] Pablo A Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, Citeseer, 2000. [5] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. [6] Steve Zymler, Daniel Kuhn, and Berç Rustem. Distributionally robust joint chance constraints with second-order moment information. Mathematical Programming, 137(1):167 198, 2011. 8

A Linear kernel The linear kernel is defined as k(x, x ) = σb 2 + σ2 ν(x c) (x c). This kernel is non-stationary. The hyperparameter c determines the point that all lines in the posterior go through. The hyperparameter σb 2 specifies the absolute value of the function at zero by putting a prior on it. When only one future point is considered (k = 1), then the optimization problem (9) can be simplified as follows. We denote by 1 a vector with all components one. Define A = X 0 c1, B = ( ( ) AA σ 2I ( ) + n σν + σ 211 1 b ) and w = x c. The resulting one dimensional variance of (1) is given by [ ] ( )[ ] B σ 2 (x) = σb 2 + σνw 2 w σνaw 2 + σb 2 1 σ σ νaw 2 + σ 2 ν 2 b 1 ( ) σ = σb 2 + σνw 2 (I A BA)w 2σb 2 1 2 2 (19) BAw b 1 B1, which, as we will prove below, is a positive definite quadratic. 1 First, note that for E S n ++, F S n + the following equivalence holds E F = z z z E 1/2 FE 1/2 z z R n = z E 1/2 F + E 1/2 z z z z R(F ) = z F + z z E 1 z z R(F ), where E 1/2 denotes the principal square root of E, F + the pseudoinvesre of F, and R(F ) the row space of F. Using the above result, we have ( ) 2 ( ) ) 2 z (AA σn σb + I + 11 z z AA z z R n = z (AA + ( σn ) 2 I + ( σb ) 2 11 ) 1 z z (AA ) + z z R(A A) = z A BAz za (AA ) + Az z R n, 1 This can also be proved very easily by noting that σ 2 (x) is in quadratic form and, as a valid variance function, it is always positive. However we leave the alternative proof to provide a better insight. 9

since Az R(A A) z R n. Finally, note that A (AA ) + A I since z A (AA ) + z Iz, z R(A ) Az = 0, z N (A). (20) Hence, we conclude that A BA I and, as a result, (19) is a positive definite quadratic. The one dimensional mean function µ(x) of (1) is a linear function of x and is given by [ ( ) ] 2 σb µ(x) = Aw + 1 By 0. (21) As a result, the objective function of (9) is quadratic separately on (x) and (M, λ). 10