Probabilistic numerics for deep learning

Similar documents
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

GWAS V: Gaussian processes

Global Optimisation with Gaussian Processes. Michael A. Osborne Machine Learning Research Group Department o Engineering Science University o Oxford

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Probabilistic Graphical Models

Nonparametric Bayesian Methods (Gaussian Processes)

Big model configuration with Bayesian quadrature. David Duvenaud, Roman Garnett, Tom Gunter, Philipp Hennig, Michael A Osborne and Stephen Roberts.

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Gaussian Process Regression

Predictive Variance Reduction Search

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Lecture : Probabilistic Machine Learning

GAUSSIAN PROCESS REGRESSION

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Introduction to Gaussian Processes

Learning Gaussian Process Models from Uncertain Data

Model Selection for Gaussian Processes

ECE521 week 3: 23/26 January 2017

Quantifying mismatch in Bayesian optimization

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Active and Semi-supervised Kernel Classification

Neutron inverse kinetics via Gaussian Processes

COMP 551 Applied Machine Learning Lecture 21: Bayesian optimisation

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Bayesian Learning (II)

Robust Monte Carlo Methods for Sequential Planning and Decision Making

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Nonparameteric Regression:

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Bayesian Models in Machine Learning

Probability and Information Theory. Sargur N. Srihari

A Process over all Stationary Covariance Kernels

Bayesian Approaches Data Mining Selected Technique

Lecture 5: GPs and Streaming regression

Gaussian Processes in Machine Learning

STA 4273H: Sta-s-cal Machine Learning

PMR Learning as Inference

Introduction. Chapter 1

Introduction to Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Introduction to Gaussian Processes

Fundamentals of Data Assimila1on

Gaussian Processes (10/16/13)

Part 1: Expectation Propagation

Relevance Vector Machines

Expectation Propagation in Dynamical Systems

Multivariate Bayesian Linear Regression MLAI Lecture 11

Machine learning - HT Maximum Likelihood

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Probabilistic & Unsupervised Learning

Numerical Solutions of ODEs by Gaussian (Kalman) Filtering

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Pattern Recognition and Machine Learning

CS-E3210 Machine Learning: Basic Principles

Kernel methods, kernel SVM and ridge regression

Parallel Gaussian Process Optimization with Upper Confidence Bound and Pure Exploration

Approximate Inference Part 1 of 2

Bayesian Regression Linear and Logistic Regression

Grundlagen der Künstlichen Intelligenz

Approximate Inference Part 1 of 2

Reliability Monitoring Using Log Gaussian Process Regression

Advanced Introduction to Machine Learning CMU-10715

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Gaussian with mean ( µ ) and standard deviation ( σ)

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Gaussian Process Optimization with Mutual Information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Knowledge-Gradient Methods for Bayesian Optimization

Probabilistic Machine Learning. Industrial AI Lab.

Stochastic Spectral Approaches to Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Probabilistic Reasoning in Deep Learning

Introduction to Gaussian Process

Bayesian Networks BY: MOHAMAD ALSABBAGH

Optimisation séquentielle et application au design

DD Advanced Machine Learning

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Parallel Particle Filter in Julia

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variational Autoencoders

Belief and Desire: On Information and its Value

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Density Estimation. Seungjin Choi

Bayesian Methods for Machine Learning

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Kernel adaptive Sequential Monte Carlo

Efficient Likelihood-Free Inference

Transcription:

Presenter: Shijia Wang Department of Engineering Science, University of Oxford rning (RLSS) Summer School, Montreal 2017

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Definition Take the things were most interested in achieving and apply to computation Apply probability theory to numerics (computation cores)

Learning Algorithms Use numeric functions as learning algorithms Idea is to use Bayesian probability theories

Rosenbrock

Rosenbrock Easy to graph on a computer No easy way of finding its global minimum since it lies in a flat parabolic region Minimum f (x, y) = 0 when (x, y) = (1, 1) Reason: computational limits from the optimization problem

Uncertainty Epistemically uncertain about the function due to being unable to afford computation Probabilistically model function and use tools from decision theory to make optimal use of computation

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Probability Probability is an expression of confidence in a proposition Probability theory can quantify inverse of logic expression Depends on the agent s prior knowledge

Gaussian Distribution Allows for distributions for variables conditioned on any other observed variables. Multivariate Gaussian Distribution: µ is mean Σ is covariance matrix 1 det(2πσ) e 1 2 (x µ) Σ 1 (x µ)

Gaussian Process Gaussian Process is a collection of random variables that any finite subset of the variables has a multivariate Gaussian distribution. Defined by mean and covariance function. Generalizes to potentially infinite number of variables.

Kernels/Covariance Function Squared exponential kernel: K SE (x 1, x 2 ) = Aexp( 1 2 d D (x 1d x 2d ) 2 h d ) A the signal variance matrix, describes variation from the mean h d the lengthscale, describes smoothness

Kernels/Covariance Function Matern kernel: K Matern(3/2) (x 1, x 2 ) = A(1 + 3r)exp( 3r) K Matern(5/2) (x 1, x 2 ) = A(1 + 5r + 5 3 r 2 )exp( 5r) A the signal variance matrix, describes variation from the mean (x 2d x 1d ) 2 h d r = d D h d the lengthscale, describes smoothness

Mean Posterior estimates: m(x D) = 1 K K m(x D, λ k ) k=0 K number of draws of the hyperparameter values that have been made by slice sampling λ prior data observed

Gaussian Process Complexity that grows with data Robust to overfitting

Gaussian Process

Gaussian Process

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Bayesian Optimization Bayesian optimization is the approach of probabilistically modelling f (x, y) and using decision theory to make optimal use of computation By defining the costs of observation and uncertainty, we can select evaluations optimally by minimizing the expected loss with respect to a probability distribution Representing the core components: cost evaluation and degree of uncertainty

Acquisition Function Acquisition Function α(x) quantifies how valuable evaluating at x is expected to be Evaluated on the GP rather than the objective. Since working on GP is less costly, can find its global maximum and use the point as the next evaluation of the objective function.

Entropy Search AF Optimization is viewed as gaining knowledge about the location of the global minimum. Prior belief about the location of the global minimum of the objective is represented as a probability distribution p(x ). The probability that x = argmin x f (x) Selects points to maximize the relative entropy of this distribution from the uniform distribution: x n+1 = argmax x (H[p(x D n )] E x [H[p(x D n, x, y)]]) H[p] = i p ilogp i entropy

Predictive Entropy Search The acquisition function α(x) is the expected information gain about the value at x n+1 given a true observation of the global minimum: α(x n+1 ) = H[y n+1 D n, x n+1 ] H[y n+1 D n, x n+1, x ]

Loss loss function - lowest function value found after algorithm ends Take a myopic approximation and consider only the next evaluation The expected loss is the expected lowest value of the function we ve evaluated after the next iteration

Myopic Loss Consider only with one evaluation remaining, the loss of returning value y with current lowest value µ

Expected Loss Expected loss is the expected lowest value

Expected Loss Use a Gaussian process as the probability distribution for the objective function

Expected Loss Exploitative step

Expected Loss Exploratory step

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Noise Using only a subset of the data gives a noisy likelihood evaluation Use Bayesian optimization for stochastic learning

Noise Within Bayesian Optimization noise is not a problem If additional noise in the random variable we can just add a noise likelihood to complement model Encode that cost as a function of the number of data Intelligently choose the size of data that it needs at runtime to best optimization

Bayesian Optimization Batch size Klein, Falkner, Bartels, Hennig, Hutter (2017); McLeod, Osborne, Roberts (2017)

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Integration Normally we want integration rather than optimization Average over the calculated parameters and functions by their likelihoods Reduces uncertainty of calculated functions Uses Bayesian quadrature for numerical integration

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Model Propagates uncertainty

Model Converges

Outline 1 Introduction Probabilistic Numerics 2 Components Probabilistic modeling of functions Bayesian optimization Bayesian stochastic optimization Integration beats Optimization 3 Conclusion Experiments Papers

Papers McLeod, Osborne & Roberts (2017). Practical Bayesian Optimization for Variable Cost Objectives. arxiv.org/abs/1703.04335 Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., & Roberts, S. J. (2014). Sampling for Inference in Probabilistic Models with Fast Bayesian Quadrature. In Advances in Neural Information Processing Systems (NIPS).