Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016
Motivation Machine Learning Algorithms (MLA s) have hyperparameters that often need to be tuned model hyperparameters (e.g. Bayesian models) regularization parameters optimization procedure parameters step size minibatch size
Motivation Machine Learning Algorithms (MLA s) have hyperparameters that often need to be tuned model hyperparameters (e.g. Bayesian models) regularization parameters optimization procedure parameters step size minibatch size Can we automate the optimization of these high-level parameters?
Motivation Machine Learning Algorithms (MLA s) have hyperparameters that often need to be tuned model hyperparameters (e.g. Bayesian models) regularization parameters optimization procedure parameters step size minibatch size Can we automate the optimization of these high-level parameters? With some assumptions and Bayesian magic, yes!
Gaussian Process Usually we observe inputs x i and outputs y i. For now, we assume y i = f(x i ) (no noise) for some unkown function f. Gaussian Processes (GP s) approach the prediction problem by inferring a distribution over functions given the data p(f X, y) and then make predictions as p(y x, X, y) = p(y f, x ) p(f X, y) df A Gaussian Process defines a prior over functions posterior over functions once we see data.
Gaussian Process A Gaussian Process is defined so that for any n N ( f(x1 ),..., f(x n ) ) N (µ, K) where µ R n and K i,j = κ(x i, x j ) for a positive definite kernel function κ. Key Idea: If x i and x j are similar by kernel, then the output of the function at those points should be similar.
Gaussian Process Let the prior on the regression function be a GP f(x) GP ( m(x), κ(x, x ) ) m(x) = E[f(x)] κ(x, x ) = E[(f(x) m(x))(f(x ) m(x )) T ] are the mean and covariance/kernel function respectively. For finite set of points, defines a joint Gaussian p(f X) = N (f µ, K) where µ = (m(x 1 ),..., m(x n )), usually m(x) = 0.
GP Noise-free and Multivariate Gaussian Refresher We see training set D = {(x i, f i ), i [N]} where f i = f(x i ) Given a test set X of size N D, want to predict output f By definition of GP ( ) (( f µ N f µ ) ( K K, K T K ))
GP Noise-free and Multivariate Gaussian Refresher We see training set D = {(x i, f i ), i [N]} where f i = f(x i ) Given a test set X of size N D, want to predict output f By definition of GP ( ) (( f µ N Thus f µ f p(f X, X, f) = N (ˆµ, ˆΣ) ) ( K K, K T K )) ˆµ = µ(x ) + K T K 1 (f µ(x)) ˆΣ = K K T K 1 K
Priors and Kernels Samples from a prior p(f X), using squared exponential / Gaussian / RBF kernel { κ(x, x ) = σf 2 exp 1 } 2l 2 (x x ) 2 (1D Case) l controls horizontal scale of variation, σf 2 controls vertical variation.
Priors and Kernels Samples from a prior p(f X), using squared exponential / Gaussian / RBF kernel { κ(x, x ) = σf 2 exp 1 } 2l 2 (x x ) 2 (1D Case) l controls horizontal scale of variation, σf 2 controls vertical variation. Automatic Relevance Determination (ARD) squared exponential kernel { κ(x, x ) = θ 0 exp 1 } 2 (x x ) T Diag(θ) 1 (x x ) θ = [θ 1 1 θ2 d ]
Noisy Observations Actually observe y where y = f(x) + ε and ε N (0, σ 2 y) then Cov(y X) = K + σ 2 yi =: K y Assume E[f(x)] = 0 (so is y) then in the case of a single test input where ˆµ = k T K 1 y y ˆΣ = k k T K 1 y k k = [κ(x, x 1 ),..., κ(x, x N ) ] and k = κ(x, x )
Bayesian Optimization with GP Priors Setup for Bayesian Optimization (x vector of MLA hyperparameters) 1. x X R D and X bounded 2. f(x) is drawn from GP prior 3. Want to minimize f(x) on X 4. Observations are of form {x n, y n } N n=1 with y n N (f(x n ), ν) 5. Acquisition function (AF), a : X R +, is used via x next = arg max x a(x)
Bayesian Optimization with GP Priors Setup for Bayesian Optimization (x vector of MLA hyperparameters) 1. x X R D and X bounded 2. f(x) is drawn from GP prior 3. Want to minimize f(x) on X 4. Observations are of form {x n, y n } N n=1 with y n N (f(x n ), ν) 5. Acquisition function (AF), a : X R +, is used via x next = arg max x a(x) a(x) = a(x; {x n, y n }, θ), depends on previous observations and GP hyperparameters Depend on model solely through Predictive mean function - µ(x; {xn, y n }, θ) Predictive variance function - σ 2 (x; {x n, y n }, θ)
What is f? Framework useful for f when its evaluations are expensive. The case when requires training a machine learning algorithm Thus, should be smart about where we evaluate next
Acquisition Functions Let φ, Φ be the pdf, cdf of a standard normal. x best = arg min xn f(x n ) 1. Probability of Improvement. a P I (x; {x n, y n }, θ) = Φ(γ(x)) = P(N γ(x)) and N N (0, 1). γ(x) = f(x best) µ(x; {x n, y n }, θ) σ(x; {x n, y n }, θ)
Acquisition Functions Let φ, Φ be the pdf, cdf of a standard normal. x best = arg min xn f(x n ) 1. Probability of Improvement. a P I (x; {x n, y n }, θ) = Φ(γ(x)) = P(N γ(x)) γ(x) = f(x best) µ(x; {x n, y n }, θ) σ(x; {x n, y n }, θ) and N N (0, 1). Points that have a high probability of being infinitesimally less than f(x best ) will be drawn over points that offer larger gains but less certainty.
Acquisition Functions Let φ, Φ be the pdf, cdf of a standard normal. x best = arg min xn f(x n ) 1. Probability of Improvement. a P I (x; {x n, y n }, θ) = Φ(γ(x)) = P(N γ(x)) γ(x) = f(x best) µ(x; {x n, y n }, θ) σ(x; {x n, y n }, θ) and N N (0, 1). Points that have a high probability of being infinitesimally less than f(x best ) will be drawn over points that offer larger gains but less certainty. 2. Expected Improvement (over current best) [BCd10] a EI (x; {x n, y n }, θ) = σ({x; x n, y n }, θ) [γ(x)φ(γ(x)) + φ(γ(x))]
Covariance Function and its Hyperparameters ARDSE kernel too smooth, instead use ARD Matérn 5/2 kernel: K M52 (x, x ) = θ 0 (1 + 5r 2 (x, x ) + 5 ) { 3 r2 (x, x ) exp } 5r 2 (x, x ) Samples functions which are twice differentiable. r 2 (x, x ) = D d=1 (x d x d )2 /θ 2 d. D + 3 Hyperparameters D Scales θ 1:D Amplitude θ 0 Observation Noise ν Constant mean m
Integrated Acquisition Function To be fully bayesian, we should marginalize over hyperparameters (denote by θ) by computing Integrated Acquisition Function (IAF) â(x; {x n, y n }) = a(x; {x n, y n }, θ) p(θ {x n, y n }) dθ This expectation is a good generalization for the uncertainty in chosen parameters Can blend a( ) functions arising from posterior over GP hyperparameters, and then use a Monte Carlo estimate of Integrated Expected Improvement (IEI) To do this MC, use Slice Sampling [MP10]
Costs Don t just care about minimizing f Evaluating f can result in vastly different execution times depending on MLA hyperparameters Propose optimizing expected improvement per second Don t know true f, also don t know c(x) : X R + the duration function. Solution: Model ln c(x) along with f, assuming independence, makes computation easier.
Parallelization Scheme Use batch parallelism plus sequential strategy over yet to be evaluated points by computing MC estimattes of AF over different possible realizations of y s. N evaluations have completed, {x n, y n } N n=1 J evaluations pending at locations { x j } J j=1 Choose new point based on expected AFunder all possible outcomes of pending evaluations â(x; {x n, y n }, θ, {x j }) = R J a(x; {x n, y n }, θ, {x j, y j }) p({y j } {x j }, {x n, y n }) dy 1 dy J
Methods and Metrics Expected improvement with GP HP marginalization as GP EI MCMC Optimizing hyperparameters as GP EI Opt, EI per second as GP EI per Second N times parallelized GP EI MCMC as Nx GP EI MCMC
Online LDA Hyperparameters Learning rate ρ t = (τ 0 + t) κ (τ 0, κ) minibatch size Cited Papers uses exhaustive search of size 6 6 8 (288)
Results
3-layer CNN Hyperparameters Epochs to run model Learning rate Four weight costs (one for each layer and the softmax output weights) Width, scale and power of the response normalization on the pooling layers Cited Papers uses exhaustive search of size 6 6 8 (288)
Results
References E. Brochu, V. M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. ArXiv e-prints, December 2010. I. Murray and R. Prescott Adams. Slice sampling covariance hyperparameters of latent Gaussian models. ArXiv e-prints, June 2010. J. Snoek, H. Larochelle, and R. P. Adams. Algorithms. ArXiv e-prints, June 2012.