Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Similar documents
Gaussian Processes (10/16/13)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

GAUSSIAN PROCESS REGRESSION

Quantifying mismatch in Bayesian optimization

Talk on Bayesian Optimization

Introduction to Gaussian Processes

Nonparameteric Regression:

Gaussian Process Regression

CPSC 540: Machine Learning

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

20: Gaussian Processes

STAT 518 Intro Student Presentation

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Model Selection for Gaussian Processes

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Learning Gaussian Process Models from Uncertain Data

Dynamic Batch Bayesian Optimization

Student-t Process as Alternative to Gaussian Processes Discussion

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Probabilistic & Unsupervised Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

GWAS V: Gaussian processes

Density Estimation. Seungjin Choi

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Kernel adaptive Sequential Monte Carlo

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Nonparametric Bayesian Methods (Gaussian Processes)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Gaussian with mean ( µ ) and standard deviation ( σ)

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Hierarchical Modeling for Univariate Spatial Data

Gaussian Processes in Machine Learning

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Prediction of Data with help of the Gaussian Process Method

Probabilistic numerics for deep learning

Physics 403. Segev BenZvi. Propagation of Uncertainties. Department of Physics and Astronomy University of Rochester

Reliability Monitoring Using Log Gaussian Process Regression

A Review of Pseudo-Marginal Markov Chain Monte Carlo

How to build an automatic statistician

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Introduction to Probabilistic Machine Learning

Multivariate Normal & Wishart

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Introduction to Gaussian Processes

Lecture 5: GPs and Streaming regression

Predictive Variance Reduction Search

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Naïve Bayes classification

A parametric approach to Bayesian optimization with pairwise comparisons

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

The Bayesian approach to inverse problems

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016

Lecture : Probabilistic Machine Learning

Introduction to Gaussian Processes

Nonparametric Regression With Gaussian Processes

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Information-Based Multi-Fidelity Bayesian Optimization

An Introduction to Bayesian Linear Regression

An Empirical Bayes Approach to Optimizing Machine Learning Algorithms

Gaussian Processes for Machine Learning

Chapter 4 - Fundamentals of spatial processes Lecture notes

Advanced Introduction to Machine Learning CMU-10715

Introduction to Gaussian Process

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Lecture 2: From Linear Regression to Kalman Filter and Beyond

A short introduction to INLA and R-INLA

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Gaussian processes for inference in stochastic differential equations

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

arxiv: v1 [stat.ml] 16 Jun 2014

CS 7140: Advanced Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Bayesian Inference

CSC321 Lecture 18: Learning Probabilistic Models

Hierarchical Modelling for Univariate Spatial Data

Towards a Bayesian model for Cyber Security

Gaussian Process Regression Networks

Gaussian Process Regression Forecasting of Computer Network Conditions

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Multivariate Bayesian Linear Regression MLAI Lecture 11

Optimisation séquentielle et application au design

STA 4273H: Statistical Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Computer Emulation With Density Estimation

Bayesian Support Vector Machines for Feature Ranking and Selection

Joint Emotion Analysis via Multi-task Gaussian Processes

CS 195-5: Machine Learning Problem Set 1

Non-Gaussian likelihoods for Gaussian Processes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Transcription:

Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016

Motivation Machine Learning Algorithms (MLA s) have hyperparameters that often need to be tuned model hyperparameters (e.g. Bayesian models) regularization parameters optimization procedure parameters step size minibatch size

Motivation Machine Learning Algorithms (MLA s) have hyperparameters that often need to be tuned model hyperparameters (e.g. Bayesian models) regularization parameters optimization procedure parameters step size minibatch size Can we automate the optimization of these high-level parameters?

Motivation Machine Learning Algorithms (MLA s) have hyperparameters that often need to be tuned model hyperparameters (e.g. Bayesian models) regularization parameters optimization procedure parameters step size minibatch size Can we automate the optimization of these high-level parameters? With some assumptions and Bayesian magic, yes!

Gaussian Process Usually we observe inputs x i and outputs y i. For now, we assume y i = f(x i ) (no noise) for some unkown function f. Gaussian Processes (GP s) approach the prediction problem by inferring a distribution over functions given the data p(f X, y) and then make predictions as p(y x, X, y) = p(y f, x ) p(f X, y) df A Gaussian Process defines a prior over functions posterior over functions once we see data.

Gaussian Process A Gaussian Process is defined so that for any n N ( f(x1 ),..., f(x n ) ) N (µ, K) where µ R n and K i,j = κ(x i, x j ) for a positive definite kernel function κ. Key Idea: If x i and x j are similar by kernel, then the output of the function at those points should be similar.

Gaussian Process Let the prior on the regression function be a GP f(x) GP ( m(x), κ(x, x ) ) m(x) = E[f(x)] κ(x, x ) = E[(f(x) m(x))(f(x ) m(x )) T ] are the mean and covariance/kernel function respectively. For finite set of points, defines a joint Gaussian p(f X) = N (f µ, K) where µ = (m(x 1 ),..., m(x n )), usually m(x) = 0.

GP Noise-free and Multivariate Gaussian Refresher We see training set D = {(x i, f i ), i [N]} where f i = f(x i ) Given a test set X of size N D, want to predict output f By definition of GP ( ) (( f µ N f µ ) ( K K, K T K ))

GP Noise-free and Multivariate Gaussian Refresher We see training set D = {(x i, f i ), i [N]} where f i = f(x i ) Given a test set X of size N D, want to predict output f By definition of GP ( ) (( f µ N Thus f µ f p(f X, X, f) = N (ˆµ, ˆΣ) ) ( K K, K T K )) ˆµ = µ(x ) + K T K 1 (f µ(x)) ˆΣ = K K T K 1 K

Priors and Kernels Samples from a prior p(f X), using squared exponential / Gaussian / RBF kernel { κ(x, x ) = σf 2 exp 1 } 2l 2 (x x ) 2 (1D Case) l controls horizontal scale of variation, σf 2 controls vertical variation.

Priors and Kernels Samples from a prior p(f X), using squared exponential / Gaussian / RBF kernel { κ(x, x ) = σf 2 exp 1 } 2l 2 (x x ) 2 (1D Case) l controls horizontal scale of variation, σf 2 controls vertical variation. Automatic Relevance Determination (ARD) squared exponential kernel { κ(x, x ) = θ 0 exp 1 } 2 (x x ) T Diag(θ) 1 (x x ) θ = [θ 1 1 θ2 d ]

Noisy Observations Actually observe y where y = f(x) + ε and ε N (0, σ 2 y) then Cov(y X) = K + σ 2 yi =: K y Assume E[f(x)] = 0 (so is y) then in the case of a single test input where ˆµ = k T K 1 y y ˆΣ = k k T K 1 y k k = [κ(x, x 1 ),..., κ(x, x N ) ] and k = κ(x, x )

Bayesian Optimization with GP Priors Setup for Bayesian Optimization (x vector of MLA hyperparameters) 1. x X R D and X bounded 2. f(x) is drawn from GP prior 3. Want to minimize f(x) on X 4. Observations are of form {x n, y n } N n=1 with y n N (f(x n ), ν) 5. Acquisition function (AF), a : X R +, is used via x next = arg max x a(x)

Bayesian Optimization with GP Priors Setup for Bayesian Optimization (x vector of MLA hyperparameters) 1. x X R D and X bounded 2. f(x) is drawn from GP prior 3. Want to minimize f(x) on X 4. Observations are of form {x n, y n } N n=1 with y n N (f(x n ), ν) 5. Acquisition function (AF), a : X R +, is used via x next = arg max x a(x) a(x) = a(x; {x n, y n }, θ), depends on previous observations and GP hyperparameters Depend on model solely through Predictive mean function - µ(x; {xn, y n }, θ) Predictive variance function - σ 2 (x; {x n, y n }, θ)

What is f? Framework useful for f when its evaluations are expensive. The case when requires training a machine learning algorithm Thus, should be smart about where we evaluate next

Acquisition Functions Let φ, Φ be the pdf, cdf of a standard normal. x best = arg min xn f(x n ) 1. Probability of Improvement. a P I (x; {x n, y n }, θ) = Φ(γ(x)) = P(N γ(x)) and N N (0, 1). γ(x) = f(x best) µ(x; {x n, y n }, θ) σ(x; {x n, y n }, θ)

Acquisition Functions Let φ, Φ be the pdf, cdf of a standard normal. x best = arg min xn f(x n ) 1. Probability of Improvement. a P I (x; {x n, y n }, θ) = Φ(γ(x)) = P(N γ(x)) γ(x) = f(x best) µ(x; {x n, y n }, θ) σ(x; {x n, y n }, θ) and N N (0, 1). Points that have a high probability of being infinitesimally less than f(x best ) will be drawn over points that offer larger gains but less certainty.

Acquisition Functions Let φ, Φ be the pdf, cdf of a standard normal. x best = arg min xn f(x n ) 1. Probability of Improvement. a P I (x; {x n, y n }, θ) = Φ(γ(x)) = P(N γ(x)) γ(x) = f(x best) µ(x; {x n, y n }, θ) σ(x; {x n, y n }, θ) and N N (0, 1). Points that have a high probability of being infinitesimally less than f(x best ) will be drawn over points that offer larger gains but less certainty. 2. Expected Improvement (over current best) [BCd10] a EI (x; {x n, y n }, θ) = σ({x; x n, y n }, θ) [γ(x)φ(γ(x)) + φ(γ(x))]

Covariance Function and its Hyperparameters ARDSE kernel too smooth, instead use ARD Matérn 5/2 kernel: K M52 (x, x ) = θ 0 (1 + 5r 2 (x, x ) + 5 ) { 3 r2 (x, x ) exp } 5r 2 (x, x ) Samples functions which are twice differentiable. r 2 (x, x ) = D d=1 (x d x d )2 /θ 2 d. D + 3 Hyperparameters D Scales θ 1:D Amplitude θ 0 Observation Noise ν Constant mean m

Integrated Acquisition Function To be fully bayesian, we should marginalize over hyperparameters (denote by θ) by computing Integrated Acquisition Function (IAF) â(x; {x n, y n }) = a(x; {x n, y n }, θ) p(θ {x n, y n }) dθ This expectation is a good generalization for the uncertainty in chosen parameters Can blend a( ) functions arising from posterior over GP hyperparameters, and then use a Monte Carlo estimate of Integrated Expected Improvement (IEI) To do this MC, use Slice Sampling [MP10]

Costs Don t just care about minimizing f Evaluating f can result in vastly different execution times depending on MLA hyperparameters Propose optimizing expected improvement per second Don t know true f, also don t know c(x) : X R + the duration function. Solution: Model ln c(x) along with f, assuming independence, makes computation easier.

Parallelization Scheme Use batch parallelism plus sequential strategy over yet to be evaluated points by computing MC estimattes of AF over different possible realizations of y s. N evaluations have completed, {x n, y n } N n=1 J evaluations pending at locations { x j } J j=1 Choose new point based on expected AFunder all possible outcomes of pending evaluations â(x; {x n, y n }, θ, {x j }) = R J a(x; {x n, y n }, θ, {x j, y j }) p({y j } {x j }, {x n, y n }) dy 1 dy J

Methods and Metrics Expected improvement with GP HP marginalization as GP EI MCMC Optimizing hyperparameters as GP EI Opt, EI per second as GP EI per Second N times parallelized GP EI MCMC as Nx GP EI MCMC

Online LDA Hyperparameters Learning rate ρ t = (τ 0 + t) κ (τ 0, κ) minibatch size Cited Papers uses exhaustive search of size 6 6 8 (288)

Results

3-layer CNN Hyperparameters Epochs to run model Learning rate Four weight costs (one for each layer and the softmax output weights) Width, scale and power of the response normalization on the pooling layers Cited Papers uses exhaustive search of size 6 6 8 (288)

Results

References E. Brochu, V. M. Cora, and N. de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. ArXiv e-prints, December 2010. I. Murray and R. Prescott Adams. Slice sampling covariance hyperparameters of latent Gaussian models. ArXiv e-prints, June 2010. J. Snoek, H. Larochelle, and R. P. Adams. Algorithms. ArXiv e-prints, June 2012.