Afternoon Meeting on Bayesian Computation 2018 University of Reading

Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of Reading

High-dimensional Problems Gradient-based optimization MCMC Issues arising from high dimensionality: non-convexity strong correlations multimodality

Related Work Gradient-based optimization AdaGrad AdaDelta Adam RMSProp MCMC Hamiltonian Monte Carlo Particle Monte Carlo Stochastic gradient Langevin dynamics All of these methods focus on computing clever updates for optimization algorithms or for Markov chains. Novelty: to the best of our knowledge, no dimensionality reduction approaches were applied in this direction before.

The Manifold Idea After t steps of optimization or sampling, we assume the obtained points in the parameter space to be lying on a manifold. We then feed them to a dimensionality reduction method to find a lower-dimensional representation. 3D example: if the sampler/ optimizer algorithm keeps on returning proposals on a sphere surface, that information might be used to our advantage Can we perform better if the algorithm acts with knowledge of the manifold?

Latent Variable Models Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models map f Θ = {θ 1,..., θ N R D } Ω = {ω 1,..., ω N R Q } with Q < D where: - θ: observed variables/parameters - ω: latent variables - f: mapping - D, Q: dimensionalities of Θ Ω respectively

Latent Variable Models Latent Variable Models describe a set Θ through a lower-dimensional latent set Ω Latent Variable Models map f Θ = {θ 1,..., θ N R D } Ω = {ω 1,..., ω N R Q } with Q < D mapping: θ = f(ω) + η with η N(0, β 1 I) Dimensionality reduction Manifold identification The lower-dimensional manifold on which the samples lie is characterized through the latent set

Gaussian Process Latent Variable Model The choice of the dimensionality reduction method fell on the Gaussian Process Latent Variable Model [1]. GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η Motivation: Analytically sound mathematical tool Full distribution over the mapping f Full distribution over the derivatives of the mapping f [1] Lawrence, N., Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of machine learning research (2005)

Gaussian Process Gaussian Process [2] : a collection of rom variables, any finite number of which have a joint Gaussian distribution. If a real-valued stochastic process f is a GP, it will be denoted as f( ) GP(m( ), k(, )) A Gaussian Process is fully specified by a mean function m( ) a covariance function k(, ) where Training data Regression GP Real function m(ω) = E [ f(ω) ], k(ω, ω ) = E [( f(ω) m(ω) )( f(ω ) m(ω ) )] [2] Rasmussen, C. E., Williams, C. K. I., Gaussian Processes for Machine, the MIT Press (2006)

Gaussian Process Latent Variable Model GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η The likelihood of the data Θ given the latent Ω is given by 1 marginalizing the mapping 2 optimizing the latent variables Resulting likelihood: D p(θ Ω, β) = N ( θ :,j 0, K + β 1 I ) j=1 D = N ( θ :,j 0, K ) j=1 With the resulting noise model being: θ i,j = K (ωi,ω) K 1 Θ :,j + η j

Gaussian Process Latent Variable Model GPLVM: Gaussian Process prior over mapping f in θ = f(ω) + η For differentiable kernels k(, ), the Jacobian J of the mapping f can be computed analytically: J ij = f i ω j But as previously said, GPLVM can yield the full (Gaussian) distribution over the Jacobian. If the rows of J are assumed to be independent: p(j Ω, β) = D N ( ) J i,: µ Ji,:, Σ J, i=1

Recap 1 After t iterations the optimization or sampling algorithm has yielded a set of observed points Θ = {θ 1,..., θ N R D } in the parameter space 2 A GPLVM is trained on Θ in order to build a latent space Ω that describes the lower-dimensional manifold on which the optimization/sampling is allegedly taking place. We can: - move from the latent space Ω to the observed space Θ: but not viceversa (f is not invertible) θ = f(ω) + η Θ Ω - bring the gradients of a generic function g : Θ R from the observed space Θ to the latent space Ω: ωg ( f(ω) ) = µ J θ g(θ) Ω Θ In this case a punctual estimate of J is given by the mean of its distribution.

AdaGeo Gradient-based Optimization Minimization problem: θ = arg min g(θ) θ Iterative scheme solution (e.g. (stochastic) gradient descent): θ t+1 = θ t θ t( θ g) We propose, after having learned a latent representation with GPLVM, to move the problem onto the latent space Ω Minimization problem: ω = arg min g(f(ω)) ω Iterative scheme solution (e.g. (stochastic) gradient descent): ω t+1 = ω t ω t( ωg)

AdaGeo Gradient-based Optimization Algorithm 1 AdaGeo gradient-based optimization (minimization) 1: while convergence is not reached do 2: Perform T θ iterations with classic updates on the parameter space Θ: θ t = θ t( θ g(θ)) θ t+1 = θ t θ t 3: Train the GP-LVM model on the samples Θ = {θ 1,..., θ Tθ } 4: Continue performing T ω using the AdaGeo optimizer: ω t = ω t( ωg(f(ω))) ω t+1 = ω t ω t moving back to the parameter space with 5: end while. θ t+1 = f(ω t+1).

Experiment: Logistic Regression on MNIST SGD Neural Network with a single hidden layer implementing logistic regression on MNIST Dimension of parameter space: D = 7850 Dimension of latent space: Q = 9 Iterations: T θ = 20 T ω = 30 NN training loss function 10 0 10 0.5 AdaGeo-SGD 0 2 4 6 n of epochs

Experiment: Gaussian Process Training Concrete compressive strength dataset [3] : regression task with 8 real variables Composite kernel (RBF, Matérn, linear bias) Dimension of parameter space: D = 9 Dimension of latent space: Q = 3 GP Negative Log-likelihood 10 3.8 10 3.6 Iterations: T θ = 15 T ω = 15 0 200 400 600 800 1,000 10 4 n of iterations Gradient Descent AdaGeo-Gradient Descent [3] Lichman, M., UCI Machine Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information Computer Science (2013)

AdaGeo Bayesian Bayesian sampling framework: a dataset X = {x 1,..., x N } is given X is modeled with a generative model whose likelihood is p(x, θ) = N p(x i, θ), parameterized by the vector θ R D, with prior p(θ). i=1 Performing statistical inference means getting insights on the posterior distribution p(θ X) = p(x θ)p(θ) p(x) analytically or approximately through sampling.

AdaGeo Bayesian Bayesian sampling framework: a dataset X = {x 1,..., x N } is given X is modeled with a generative model whose likelihood is p(x, θ) = N p(x i, θ), parameterized by the vector θ R D, with prior p(θ). i=1 Unfortunately the denominator is often intractable. One possible approach is to approximate the integral p(x) = p(x, θ)dθ = p(x θ)p(θ)dθ Θ Θ through Markov Chain Monte Carlo or similar methods.

Stochastic Gradient Langevin Dynamics Stochastic gradient Langevin dynamics [4] combines stochastic optimization the physical concept of Langevin dynamics to build a posterior sampler At each time t a mini-batch is extracted the parameters are updated as: θ t+1 = θ t + θ t, ( θ t = ϵt 2 η t N(0, ϵ ti) θ log p(θ t) + N n ) n θ log p(x i θ t) + η t, i=1 with the learning rate ϵ t satisfying: ϵ t =, t=1 t=1 ϵ 2 t < [4] Welling, M., Teh, Y. W., Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th International Conference on Machine (ICML) (2011)

AdaGeo - Stochastic Gradient Langevin Dynamics Analogously as before, we propose to: 1 Pick your favourite sampler produce the first t samples to build the set Θ = {θ 1,..., θ N } 2 Train a GPLVM on Θ to learn the latent space Ω 3 Move the updates onto the latent space with AdaGeo - Stochastic Gradient Langevin Dynamics: ω t+1 = ω t + ω t, ( ω t = ϵt 2 η t N(0, ϵ ti) ω log p ( f(ω t) ) + N n n ω log p ( x ti f(ω )) t) + η t, i=1

Experiment: from the Banana Distribution The banana distribution has this formula: p(θ) exp θ2 1 200 (θ2 bθ2 1 + 100b) 2 2 D j=3 θ 2 j 10 θ2 0 10 θ 1 θ 2 present the interaction shown on the left, while the other variables produce Gaussian noise 20 10 0 10 20 θ 1

Experiment: from the Banana Distribution A Metropolis-Hastings returns the first 100 samples drawn from a 50-dimensional banana distribution AdaGeo-SGLD is then employed to sample from a 5-dimensional latent space 10 MH sampler AdaGeo-sampler θ2 0 θ15 0 10 2 MH sampler AdaGeo-sampler 20 10 0 10 20 θ 1 1 0 1 θ 3

Bonus Round: Riemannian Extensions (theory only) If a covariance function of a Gaussian Process is differentiable, then it is straightforward to show that the mapping f is also differentiable. Under this assumption we can compute the latent metric tensor G, which will give further information about the geometry of the latent space (distances, geodetic lines etc.) If J is the Jacobian of the mapping f, then G = J J This yields a distribution over the metric tensor [5] : [ G W Q (D, Σ J, E J ] [ ]) E J a punctual estimate can be obtained with [ ] [ E J J = E J ] [ ] E J + DΣ J. [5] Tosi, A., Hauberg, S., Vellido, A., Lawrence, N. D., Metrics for probabilistic geometries. Uncertainty in Artificial Intelligence (2014)

Bonus Round: Stochastic Gradient Riemannian Langevin Dynamics Stochastic gradient Riemannian Langevin dynamics [6] puts together the advantages of exploiting a known Riemannian geometry with the scalability of the stochastic optimization approaches: where µ(θ) j = ( 2 G 1 (θ) ( θ t+1 = θ t + θ t, θ t = ϵt 2 µ(θt) + G 1 2 (θt)η t η t N(0, ϵ ti), θ log p(θ) + N n )) n θ log p(x ti θ) i=1 D ( G 1 (θ) G(θ) ) G 1 (θ) θ k k=1 jk + j D ( G 1 (θ) ) (G Tr 1 (θ) G(θ) ) jk θ k k=1 [6] Patterson, S., Teh, Y. W., Stochastic gradient Riemannian Langevin dynamics on the probability simplex. Advances in Neural Information Processing Systems (2013)

Bonus Round: AdaGeo - Stochastic Gradient Riemannian Langevin Dynamics Analogously as the SGLD case, we can move now the update in the latent space with AdaGeo - Stochastic gradient Riemannian Langevin dynamics: where µ(ω) j = 2 ω t+1 = ω t + ω t ω t = ϵt 2 µ(ωt) + 1 G 2 ω (ω t)η t η t N(0, ϵ ti) ( ( G 1 ω (ω) ω log p ( f(ω) ) + N n Q ( k=1 G 1 ω (ω) Gω(ω) ω k ) G 1 ω (ω) + jk n ω log p ( x ti f(ω) ))) i=1 j Q ( G 1 ω (ω) ) ( Tr G 1 jk ω (ω) Gω(ω) ) ω k k=1

Conclusions We develop a generic framework for combining dimensionality reduction techniques with sampling optimization methods We contribute to gradient-based optimization methods by coupling them with appropriate dimensionality reduction techniques. In particular, we improve the performances of gradient descent stochastic gradient descent, when training respectively a Gaussian Process a neural network We contribute to Markov Chain Monte Carlo by developing a AdaGeo version of stochastic gradient Langevin dynamics; the information gathered through the latent space are employed to compute the steps of the Markov chain We extend the approach to stochastic gradient Riemannian Langevin dynamics, thanks to the geometric tensor naturally recovered by the GP-LVM model

Thank you Reference: Abbati, G., Tosi, A., Flaxman, S., Osborne, M. A. (2018).. In Proceedings of the 21st International Conference on Artificial Intelligence Statistics (AISTATS), to appear.