Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Similar documents
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Deep learning with differential Gaussian process flows

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Probabilistic & Bayesian deep learning. Andreas Damianou

Sequential Inference for Deep Gaussian Process

Gaussian Processes for Big Data. James Hensman

Nonparametric Bayesian Methods (Gaussian Processes)

Tree-structured Gaussian Process Approximations

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Non-Factorised Variational Inference in Dynamical Systems

CPSC 540: Machine Learning

Variational Autoencoder

Distributed Gaussian Processes

Gaussian Process Approximations of Stochastic Differential Equations

Black-box α-divergence Minimization

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

arxiv: v1 [stat.ml] 11 Nov 2015

Recurrent Latent Variable Networks for Session-Based Recommendation

Deep Neural Networks as Gaussian Processes

Lecture 16 Deep Neural Generative Models

Evaluating the Variance of

Fully Bayesian Deep Gaussian Processes for Uncertainty Quantification

Gaussian Process Regression Networks

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Kernel Sequential Monte Carlo

Bayesian Semi-supervised Learning with Deep Generative Models

Gaussian Process Vine Copulas for Multivariate Dependence

Sequential Inference for Deep Gaussian Process

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Lecture : Probabilistic Machine Learning

Kernel adaptive Sequential Monte Carlo

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

System identification and control with (deep) Gaussian processes. Andreas Damianou

Review: Probabilistic Matrix Factorization. Probabilistic Matrix Factorization (PMF)

The connection of dropout and Bayesian statistics

Introduction to Gaussian Processes

Stochastic Variational Deep Kernel Learning

Model Selection for Gaussian Processes

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Variational Dropout and the Local Reparameterization Trick

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Gaussian Process Regression

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Anomaly Detection and Removal Using Non-Stationary Gaussian Processes

Introduction to Gaussian Processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Dropout as a Bayesian Approximation: Insights and Applications

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

STA 4273H: Statistical Machine Learning

Kernel Learning via Random Fourier Representations

Variational Autoencoders

arxiv: v2 [stat.ml] 2 Nov 2016

Variational Dropout via Empirical Bayes

Expectation Propagation in Dynamical Systems

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Deep Gaussian Processes

STA414/2104 Statistical Methods for Machine Learning II

Auto-Encoding Variational Bayes

Knowledge Extraction from DBNs for Images

Basic Sampling Methods

The Gaussian process latent variable model with Cox regression

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

STA 4273H: Statistical Machine Learning

Kernels for Automatic Pattern Discovery and Extrapolation

Markov Chain Monte Carlo

STA 4273H: Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Deep Gaussian Processes

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Understanding Covariance Estimates in Expectation Propagation

STA 4273H: Sta-s-cal Machine Learning

PMR Learning as Inference

Probabilistic numerics for deep learning

Bayesian Calibration of Simulators with Structured Discretization Uncertainty

Analytic Long-Term Forecasting with Periodic Gaussian Processes

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Variational Inference for Monte Carlo Objectives

Approximate Inference using MCMC

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Introduction to Machine Learning

20: Gaussian Processes

Bayesian Hidden Markov Models and Extensions

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Pattern Recognition and Machine Learning

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Part 1: Expectation Propagation

The Variational Gaussian Approximation Revisited

Joint Emotion Analysis via Multi-task Gaussian Processes

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Previously Monte Carlo Integration

Variational Principal Components

Talk on Bayesian Optimization

Transcription:

Doubly Stochastic Inference for Deep Gaussian Processes Hugh Salimbeni Department of Computing Imperial College London 29/5/2017

Motivation DGPs promise much, but are difficult to train Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 2

Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 2

Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well We seek a variational approach that works and scales Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 2

Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well We seek a variational approach that works and scales Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 2

Motivation DGPs promise much, but are difficult to train Fully factorized VI doesn t work well We seek a variational approach that works and scales Other recently proposed schemes [1, 2, 5] make additional approximations and require more machinery than VI Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 2

Talk outline 1. Summary: Model Inference Results 2. Details: Model Inference Results 3. Questions Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 3

Model We use the standard DGP model, with one addition: Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 4

Model We use the standard DGP model, with one addition: We include a linear (identity) mean function for all the internal layers Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 4

Model We use the standard DGP model, with one addition: We include a linear (identity) mean function for all the internal layers (1D example in [4]) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 4

Inference We use the model conditioned on the inducing points as a conditional variational posterior Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 5

Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 5

Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) We use sampling to deal with the intractable expectation Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 5

Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) We use sampling to deal with the intractable expectation Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 5

Inference We use the model conditioned on the inducing points as a conditional variational posterior We impose Gaussians on the inducing points, (independent between layers but full rank within layers) We use sampling to deal with the intractable expectation We never compute N ˆ N matrices (we make no additional simplifications to variational posterior) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 5

Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 6

Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 6

Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 6

Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better We can get 98.1% on mnist with only 100 inducing points Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 6

Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better We can get 98.1% on mnist with only 100 inducing points We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 6

Results We show significant improvement over single layer models on large ( 10 6 ) and massive ( 10 9 ) data Big jump in improvement over single layer GP with 5ˆ number of inducing points On small data we never do worse than the single layer model, and often better We can get 98.1% on mnist with only 100 inducing points We surpass all permutation invariant methods on rectangles-images (designed to test deep vs shallow architectures) Identical model/inference hyperparameters for all our models Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 6

Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: If dimensions agree use the identity, otherwise PCA Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 7

Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: If dimensions agree use the identity, otherwise PCA Sensible alternative: initialize latents to identity (but linear mean function works better) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 7

Details: The Model We use the standard DGP model, with a linear mean function for all the internal layers: If dimensions agree use the identity, otherwise PCA Sensible alternative: initialize latents to identity (but linear mean function works better) Not so sensible alternative: random. Doesn t work well (posterior is (very) multimodal) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 7

The DGP: Graphical Model X Z 0 f 1 u 1 h 1 Z 1 f 2 u 2 h 2 Z 2 f 3 u 3 y Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 8

The DGP: Density hkkkkkikkkkkj likelihood Nπ ppy, th l, f l, u l ul 1 L q ppy i fi L qˆ i 1 Lπ pph l f l qppf l u l ; h l 1, Z l 1 qppu l ; Z l 1 q loooooooooooooooooooooooooomoooooooooooooooooooooooooon l 1 DGP prior Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 9

Factorised Variational Posterior X Z 0 X f 1 u 1 f 1 u 1 N (u 1 m 1, S 1 ) h 1 Z 1 Q i N (h1 i µ1 i, 1 i 2 ) h 1 f 2 u 2 f 2 u 2 N (u 2 m 2, S 2 ) h 2 Z 2 Q i N (h2 i µ2 i, 2 i 2 ) h 2 f 3 u 3 f 3 u 3 N (u 3 m 3, S 3 ) y y Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 10

Our Variational Posterior X Z 0 X f 1 u 1 f 1 u 1 N (u 1 m 1, S 1 ) h 1 Z 1 h 1 f 2 u 2 f 2 u 2 N (u 2 m 2, S 2 ) h 2 Z 2 h 2 f 3 u 3 f 3 u 3 N (u 3 m 3, S 3 ) y f 3 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 11

Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 12

Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Marginalise u from the variational posterior: ª ppf u; X, ZqN pu m, Sqdu N pf µ, Sq : qpf m, S; X, Zq (1) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 12

Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Marginalise u from the variational posterior: ª ppf u; X, ZqN pu m, Sqdu N pf µ, Sq : qpf m, S; X, Zq (1) Define the following mean and covariance functions: µ m,z px i q mpx i q`apx i q T pm mpzqq, S S,Z px i, x j q kpx i, x j q apx i q T pkpz, Zq Sqapx j q. where apx i q kpx i, ZqkpZ, Zq 1 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 12

Recap: GPs for Big Data [3] qpf, uq ppf u; X, ZqN pu m, Sq Marginalise u from the variational posterior: ª ppf u; X, ZqN pu m, Sqdu N pf µ, Sq : qpf m, S; X, Zq (1) Define the following mean and covariance functions: µ m,z px i q mpx i q`apx i q T pm mpzqq, S S,Z px i, x j q kpx i, x j q apx i q T pkpz, Zq Sqapx j q. where apx i q kpx i, ZqkpZ, Zq 1 With these functions rµs i µ m,z px i q and rss ij S S,Z px i, x j q. Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 12

Recap: GPs for Big Data [3] cont. Key idea: The f i marginals of qpf, uq ppf u; X, ZqN pu m, Sq depend only on the inputs x i (and the variational parameters) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 13

Our Variational Posterior X Z 0 X f 1 u 1 f 1 u 1 N (u 1 m 1, S 1 ) h 1 Z 1 h 1 f 2 u 2 f 2 u 2 N (u 2 m 2, S 2 ) h 2 Z 2 h 2 f 3 u 3 f 3 u 3 N (u 3 m 3, S 3 ) y f 3 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 14

Our approach We can marginalise all the u l from our posterior Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 15

Our approach We can marginalise all the u l from our posterior Result for lth layer is qpf l m l, S l ; h l 1, Z l 1 q. Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 15

Our approach We can marginalise all the u l from our posterior Result for lth layer is qpf l m l, S l ; h l 1, Z l 1 q. The fully coupled (both between and within layers) variational posterior is Lπ pph l f l qqpf l m l, S l ; h l 1, Z l 1 q l 1 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 15

But what about the ith marginals? Since at each layer the ith marginal depends only on the ith component of the layer below, we have qpt f l i, hl i ul l 1 q Lπ l 1 pph l i f l i qqp f l i ml, S l ; h l 1 i, Z l 1 q (2) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 16

The lower bound Since our variational posterior matches the model everywhere except the inducing points, the bound is: Lÿ L E qpt f l i,hi lul l 1 q log ppy i fi L q l 1 KLpqpu l q ppu l qq The analytic marginalisation of the all the inner layers qp f L i q is intractable, but we can draw samples ancestrally Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 17

Sampling from the variational posterior Each layer is Gaussian, given the layer below Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 18

Sampling from the variational posterior Each layer is Gaussian, given the layer below We draw samples using unit Gaussians e N p0, 1q at each layer Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 18

Sampling from the variational posterior Each layer is Gaussian, given the layer below We draw samples using unit Gaussians e N p0, 1q at each layer For ˆf u, l mean and var from qp fi l ml, S l ; ĥl 1 i, Z l 1 q b ˆf i l µ m l,z l 1pĥl 1 q`e i S S l,z l 1pĥl 1 i, ĥl 1 i q where for the first layer ĥ0 i x i Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 18

Sampling from the variational posterior Each layer is Gaussian, given the layer below We draw samples using unit Gaussians e N p0, 1q at each layer For ˆf u, l mean and var from qp fi l ml, S l ; ĥl 1 i, Z l 1 q b ˆf i l µ m l,z l 1pĥl 1 q`e i S S l,z l 1pĥl 1 i, ĥl 1 i q where for the first layer ĥ0 i x i Just add noise for the ĥl i Whole sampling process is differentiable wrt variational parameters Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 18

The second source of stochasticity We sample the bound in minibatches Linear scaling in N Can be used when only steaming is possible ( 50GB datasets) Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 19

Inference recap We use the full model as a variational posterior, conditioned on the inducing points Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 20

Inference recap We use the full model as a variational posterior, conditioned on the inducing points We use Gaussians for the inducing points Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 20

Inference recap We use the full model as a variational posterior, conditioned on the inducing points We use Gaussians for the inducing points The lower bound requires only the posterior marginals Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 20

Inference recap We use the full model as a variational posterior, conditioned on the inducing points We use Gaussians for the inducing points The lower bound requires only the posterior marginals We can take samples from the posterior marginals using a Monte Carlo estimate Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 20

Results (1): UCI Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 21

Code Demo https://github.com/icl-sml/doubly-stochastic-dgp/blob/ master/demos/demo_regression_uci.ipynb Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 22

Results (2): Large and Massive Data Test RMSE N D SGP SGP 500 DGP 2 DGP 3 DGP 4 DGP 5 year 463810 90 10.67 9.89 9.58 8.98 8.93 8.87 airline 700K 8 25.6 25.1 24.6 24.3 24.2 24.1 taxi 1B 9 337.5 330.7 281.4 270.4 268.0 266.4 Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 23

Thanks for listening Questions? Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 24

References [1] T. D. Bui, D. Hernández-Lobato, Y. Li, J. M. Hernández-Lobato, and R. E. Turner. Deep Gaussian Processes for Regression using Approximate Expectation Propagation. Icml, 2016. [2] K. Cutajar, E. V. Bonilla, P. Michiardi, and M. Filippone. Practical Learning of Deep Gaussian Processes via Random Fourier Features. arxiv preprint arxiv:1610.04386, 2016. [3] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. Uncertainty in Artificial Intelligence, pages 282 290, 2013. [4] M. Lázaro-Gredilla. Bayesian Warped Gaussian Processes. Advances in Neural Information Processing Systems, pages 1619 1627, 2012. [5] Y. Wang, M. Brubaker, B. Chaib-Draa, and R. Urtasun. Sequential Inference for Deep Gaussian Process. Artificial Intelligence and Statistics, 2016. Doubly Stochastic Inference for DGPs Hugh Salimbeni @Amazon Berlin, 29/5/2017 25