Afternoon Meeting on Bayesian Computation 2018 University of Reading

Similar documents
Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Monte Carlo in Bayesian Statistics

Kernel adaptive Sequential Monte Carlo

Notes on pseudo-marginal methods, variational Bayes and ABC

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

STA 4273H: Sta-s-cal Machine Learning

Bayesian Inference and MCMC

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Density Estimation. Seungjin Choi

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Approximate Slice Sampling for Bayesian Posterior Inference

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Riemann Manifold Methods in Bayesian Statistics

CPSC 540: Machine Learning

A Review of Pseudo-Marginal Markov Chain Monte Carlo

On Markov chain Monte Carlo methods for tall data

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Machine Learning

Kernel Sequential Monte Carlo

Variational Inference via Stochastic Backpropagation

Nonparametric Bayesian Methods (Gaussian Processes)

Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization

Probabilistic Graphical Models

Learning the hyper-parameters. Luca Martino

Pattern Recognition and Machine Learning

Annealing Between Distributions by Averaging Moments

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

20: Gaussian Processes

Approximate Slice Sampling for Bayesian Posterior Inference

Scaling Neighbourhood Methods

arxiv: v4 [stat.co] 4 May 2016

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

STA414/2104 Statistical Methods for Machine Learning II

Hamiltonian Monte Carlo for Scalable Deep Learning

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

Auto-Encoding Variational Bayes

Markov Chain Monte Carlo

Nonparameteric Regression:

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

STAT 518 Intro Student Presentation

STA 4273H: Statistical Machine Learning

Scalable Deep Poisson Factor Analysis for Topic Modeling: Supplementary Material

Deep Learning & Artificial Intelligence WS 2018/2019

Variational Autoencoders

STA 4273H: Statistical Machine Learning

Supplementary Material of High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

Provable Nonparametric Bayesian Inference

STA 4273H: Statistical Machine Learning

Approximate Inference using MCMC

Probabilistic Graphical Models Lecture 20: Gaussian Processes

MCMC algorithms for fitting Bayesian models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Machine Learning

Bayesian Deep Learning

LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING

Introduction to Gaussian Process

Introduction to Probabilistic Machine Learning

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSC2541 Lecture 5 Natural Gradient

Prediction of Data with help of the Gaussian Process Method

Probabilistic Reasoning in Deep Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Tutorial on Probabilistic Programming with PyMC3

Recurrent Latent Variable Networks for Session-Based Recommendation

Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

Algorithms for Variational Learning of Mixture of Gaussians

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Bayesian Methods for Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Scaling up Bayesian Inference

17 : Optimization and Monte Carlo Methods

The Expectation-Maximization Algorithm

Time-Sensitive Dirichlet Process Mixture Models

Machine Learning. Probabilistic KNN.

Manifold Monte Carlo Methods

Bayesian Machine Learning

Modeling human function learning with Gaussian processes

Lecture 16 Deep Neural Generative Models

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Latent Variable Models

Learning Gaussian Process Models from Uncertain Data

Model Selection for Gaussian Processes

DAG models and Markov Chain Monte Carlo methods a short overview

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Gaussian Processes in Machine Learning

Outline Lecture 2 2(32)

an introduction to bayesian inference

Online but Accurate Inference for Latent Variable Models with Local Gibbs Sampling

FastGP: an R package for Gaussian processes

Lecture 7 and 8: Markov Chain Monte Carlo

Probabilistic Graphical Models

Generative models for missing value completion

Riemannian Stein Variational Gradient Descent for Bayesian Inference

Transcription:

Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of Reading

High-dimensional Problems Gradient-based optimization MCMC Issues arising from high dimensionality: non-convexity strong correlations multimodality

Related Work Gradient-based optimization AdaGrad AdaDelta Adam RMSProp MCMC Hamiltonian Monte Carlo Particle Monte Carlo Stochastic gradient Langevin dynamics All of these methods focus on computing clever updates for optimization algorithms or for Markov chains. Novelty: to the best of our knowledge, no dimensionality reduction approaches were applied in this direction before.

The Manifold Idea After t steps of optimization or sampling, we assume the obtained points in the parameter space to be lying on a manifold. We then feed them to a dimensionality reduction method to find a lower-dimensional representation. 3D example: if the sampler/ optimizer algorithm keeps on returning proposals on a sphere surface, that information might be used to our advantage Can we perform better if the algorithm acts with knowledge of the manifold?

Latent Variable Models Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models map f Θ = {θ 1,..., θ N R D } Ω = {ω 1,..., ω N R Q } with Q < D where: - θ: observed variables/parameters - ω: latent variables - f: mapping - D, Q: dimensionalities of Θ Ω respectively

Latent Variable Models Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models map f Θ = {θ 1,..., θ N R D } Ω = {ω 1,..., ω N R Q } with Q < D mapping: θ = f(ω) + η with η N(0, β 1 I) Dimensionality reduction Manifold identification The lower-dimensional manifold on which the samples lie is characterized through the latent set

Gaussian Process Latent Variable Model The choice of the dimensionality reduction method fell on the Gaussian Process Latent Variable Model [1]. GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η Motivation: Analytically sound mathematical tool Full distribution over the mapping f Full distribution over the derivatives of the mapping f [1] Lawrence, N., Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of machine learning research (2005)

Gaussian Process Gaussian Process [2] : a collection of rom variables, any finite number of which have a joint Gaussian distribution. If a real-valued stochastic process f is a GP, it will be denoted as f( ) GP(m( ), k(, )) A Gaussian Process is fully specified by a mean function m( ) a covariance function k(, ) where Training data Regression GP Real function m(ω) = E [ f(ω) ], k(ω, ω ) = E [( f(ω) m(ω) )( f(ω ) m(ω ) )] [2] Rasmussen, C. E., Williams, C. K. I., Gaussian Processes for Machine, the MIT Press (2006)

Gaussian Process Latent Variable Model GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η The likelihood of the data Θ given the latent Ω is given by 1 marginalizing the mapping 2 optimizing the latent variables Resulting likelihood: D p(θ Ω, β) = N ( θ :,j 0, K + β 1 I ) j=1 D = N ( θ :,j 0, K ) j=1 With the resulting noise model being: θ i,j = K (ωi,ω) K 1 Θ :,j + η j

Gaussian Process Latent Variable Model GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η For differentiable kernels k(, ), the Jacobian J of the mapping f can be computed analytically: J ij = f i ω j But as previously said, GPLVM can yield the full (Gaussian) distribution over the Jacobian. If the rows of J are assumed to be independent: p(j Ω, β) = D N ( ) J i,: µ Ji,:, Σ J, i=1

Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.

Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.

Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.

Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.

AdaGeo Gradient-based Optimization Minimization problem: θ = arg min g(θ) θ Iterative scheme solution (e.g. (stochastic) gradient descent): θ t+1 = θ t θ t( θ g) We propose, after having learned a latent representation with GPLVM, to move the problem onto the latent space Ω Minimization problem: ω = arg min g(f(ω)) ω Iterative scheme solution (e.g. (stochastic) gradient descent): ω t+1 = ω t ω t( ωg)

AdaGeo Gradient-based Optimization Minimization problem: θ = arg min g(θ) θ Iterative scheme solution (e.g. (stochastic) gradient descent): θ t+1 = θ t θ t( θ g) We propose, after having learned a latent representation with GPLVM, to move the problem onto the latent space Ω Minimization problem: ω = arg min g(f(ω)) ω Iterative scheme solution (e.g. (stochastic) gradient descent): ω t+1 = ω t ω t( ωg)

AdaGeo Gradient-based Optimization Algorithm 1 AdaGeo gradient-based optimization (minimization) 1: while convergence is not reached do 2: Perform T θ iterations with classic updates on the parameter space Θ: θ t = θ t( θ g(θ)) θ t+1 = θ t θ t 3: Train the GP-LVM model on the samples Θ = {θ 1,..., θ Tθ } 4: Continue performing T ω using the AdaGeo optimizer: ω t = ω t( ωg(f(ω))) ω t+1 = ω t ω t moving back to the parameter space with 5: end while. θ t+1 = f(ω t+1).

Experiment: Logistic Regression on MNIST SGD Neural Network with a single hidden layer implementing logistic regression on MNIST Dimension of parameter space: D = 7850 Dimension of latent space: Q = 9 Iterations: T θ = 20 T ω = 30 NN training loss function 10 0 10 0.5 AdaGeo-SGD 0 2 4 6 n of epochs

Experiment: Gaussian Process Training Concrete compressive strength dataset [3] : regression task with 8 real variables Composite kernel (RBF, Matérn, linear bias) Dimension of parameter space: D = 9 Dimension of latent space: Q = 3 GP Negative Log-likelihood 10 3.8 10 3.6 Iterations: T θ = 15 T ω = 15 0 200 400 600 800 1,000 10 4 n of iterations Gradient Descent AdaGeo-Gradient Descent [3] Lichman, M., UCI Machine Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information Computer Science (2013)

AdaGeo Bayesian Bayesian sampling framework: a dataset X = {x 1,..., x N } is given X is modeled with a generative model whose likelihood is p(x, θ) = N p(x i, θ), parameterized by the vector θ R D, with prior p(θ). i=1 Performing statistical inference means getting insights on the posterior distribution p(θ X) = p(x θ)p(θ) p(x) analytically or approximately through sampling.

AdaGeo Bayesian Bayesian sampling framework: a dataset X = {x 1,..., x N } is given X is modeled with a generative model whose likelihood is p(x, θ) = N p(x i, θ), parameterized by the vector θ R D, with prior p(θ). i=1 Unfortunately the denominator is often intractable. One possible approach is to approximate the integral p(x) = p(x, θ)dθ = p(x θ)p(θ)dθ Θ Θ through Markov Chain Monte Carlo or similar methods.

Stochastic Gradient Langevin Dynamics Stochastic gradient Langevin dynamics [4] combines stochastic optimization the physical concept of Langevin dynamics to build a posterior sampler At each time t a mini-batch is extracted the parameters are updated as: θ t+1 = θ t + θ t, ( θ t = ϵt 2 η t N(0, ϵ ti) θ log p(θ t) + N n ) n θ log p(x i θ t) + η t, i=1 with the learning rate ϵ t satisfying: ϵ t =, t=1 t=1 ϵ 2 t < [4] Welling, M., Teh, Y. W., Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th International Conference on Machine (ICML) (2011)

AdaGeo - Stochastic Gradient Langevin Dynamics Analogously as before, we propose to: 1 Pick your favourite sampler produce the first t samples to build the set Θ = {θ 1,..., θ N } 2 Train a GPLVM on Θ to learn the latent space Ω 3 Move the updates onto the latent space with AdaGeo - Stochastic Gradient Langevin Dynamics: ω t+1 = ω t + ω t, ( ω t = ϵt 2 η t N(0, ϵ ti) ω log p ( f(ω t) ) + N n n ω log p ( x ti f(ω )) t) + η t, i=1

Experiment: from the Banana Distribution The banana distribution has this formula: p(θ) exp θ2 1 200 (θ2 bθ2 1 + 100b) 2 2 D j=3 θ 2 j 10 θ2 0 10 θ 1 θ 2 present the interaction shown on the left, while the other variables produce Gaussian noise 20 10 0 10 20 θ 1

Experiment: from the Banana Distribution A Metropolis-Hastings returns the first 100 samples drawn from a 50-dimensional banana distribution AdaGeo-SGLD is then employed to sample from a 5-dimensional latent space 10 MH sampler AdaGeo-sampler θ2 0 θ15 0 10 2 MH sampler AdaGeo-sampler 20 10 0 10 20 θ 1 1 0 1 θ 3

Bonus Round: Riemannian Extensions (theory only) If a covariance function of a Gaussian Process is differentiable, then it is straightforward to show that the mapping f is also differentiable. Under this assumption we can compute the latent metric tensor G, which will give further information about the geometry of the latent space (distances, geodetic lines etc.) If J is the Jacobian of the mapping f, then G = J J This yields a distribution over the metric tensor [5] : [ G W Q (D, Σ J, E J ] [ ]) E J a punctual estimate can be obtained with [ ] [ E J J = E J ] [ ] E J + DΣ J. [5] Tosi, A., Hauberg, S., Vellido, A., Lawrence, N. D., Metrics for probabilistic geometries. Uncertainty in Artificial Intelligence (2014)

Bonus Round: Stochastic Gradient Riemannian Langevin Dynamics Stochastic gradient Riemannian Langevin dynamics [6] puts together the advantages of exploiting a known Riemannian geometry with the scalability of the stochastic optimization approaches: where µ(θ) j = ( 2 G 1 (θ) ( θ t+1 = θ t + θ t, θ t = ϵt 2 µ(θt) + G 1 2 (θt)η t η t N(0, ϵ ti), θ log p(θ) + N n )) n θ log p(x ti θ) i=1 D ( G 1 (θ) G(θ) ) G 1 (θ) θ k k=1 jk + j D ( G 1 (θ) ) (G Tr 1 (θ) G(θ) ) jk θ k k=1 [6] Patterson, S., Teh, Y. W., Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Advances in Neural Information Processing Systems (2013)

Bonus Round: AdaGeo - Stochastic Gradient Riemannian Langevin Dynamics Analogously as the SGLD case, we can move now the update in the latent space with AdaGeo - Stochastic gradient Riemannian Langevin dynamics: where µ(ω) j = 2 ω t+1 = ω t + ω t ω t = ϵt 2 µ(ωt) + 1 G 2 ω (ω t)η t η t N(0, ϵ ti) ( ( G 1 ω (ω) ω log p ( f(ω) ) + N n Q ( k=1 G 1 ω (ω) Gω(ω) ω k ) G 1 ω (ω) + jk n ω log p ( x ti f(ω) ))) i=1 j Q ( G 1 ω (ω) ) ( Tr G 1 jk ω (ω) Gω(ω) ) ω k k=1

Conclusions We develop a generic framework for combining dimensionality reduction techniques with sampling optimization methods We contribute to gradient-based optimization methods by coupling them with appropriate dimensionality reduction techniques. In particular, we improve the performances of gradient descent stochastic gradient descent, when training respectively a Gaussian Process a neural network We contribute to Markov Chain Monte Carlo by developing a AdaGeo version of stochastic gradient Langevin dynamics; the information gathered through the latent space are employed to compute the steps of the Markov chain We extend the approach to stochastic gradient Riemannian Langevin dynamics, thanks to the geometric tensor naturally recovered by the GP-LVM model

Thank you Reference: Abbati, G., Tosi, A., Flaxman, S., Osborne, M. A. (2018).. In Proceedings of the 21st International Conference on Artificial Intelligence Statistics (AISTATS), to appear.