Sparse Approximations for Non-Conjugate Gaussian Process Regression

Similar documents
Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Non-Gaussian likelihoods for Gaussian Processes

Tree-structured Gaussian Process Approximations

Local Expectation Gradients for Doubly Stochastic. Variational Inference

Nonparametric Bayesian Methods (Gaussian Processes)

Combine Monte Carlo with Exhaustive Search: Effective Variational Inference and Policy Gradient Reinforcement Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Automated Variational Inference for Gaussian Process Models

Variational Model Selection for Sparse Gaussian Process Regression

Deep learning with differential Gaussian process flows

Non-Factorised Variational Inference in Dynamical Systems

Pattern Recognition and Machine Learning

Black-box α-divergence Minimization

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Deep Gaussian Processes

Variational Inference for Mahalanobis Distance Metrics in Gaussian Process Regression

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Scalable Large-Scale Classification with Latent Variable Augmentation

Learning Gaussian Process Models from Uncertain Data

Variational Principal Components

Expectation Propagation in Dynamical Systems

STA 4273H: Statistical Machine Learning

Natural Gradients via the Variational Predictive Distribution

The Variational Gaussian Approximation Revisited

Bayesian Semi-supervised Learning with Deep Generative Models

STA 4273H: Statistical Machine Learning

A Process over all Stationary Covariance Kernels

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Variable sigma Gaussian processes: An expectation propagation perspective

arxiv: v1 [stat.ml] 7 Nov 2014

Large-scale Ordinal Collaborative Filtering

Expectation Propagation Algorithm

Deep Gaussian Processes

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Unsupervised Learning

arxiv: v1 [stat.ml] 7 Mar 2015

Variational Bayes on Monte Carlo Steroids

Gaussian Processes in Machine Learning

Evaluating the Variance of

Recent Advances in Bayesian Inference Techniques

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Nonparameteric Regression:

Non Linear Latent Variable Models

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

17 : Optimization and Monte Carlo Methods

Decoupled Variational Gaussian Inference

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Circular Pseudo-Point Approximations for Scaling Gaussian Processes

Doubly Stochastic Inference for Deep Gaussian Processes. Hugh Salimbeni Department of Computing Imperial College London

A Stick-Breaking Likelihood for Categorical Data Analysis with Latent Gaussian Models

Gaussian Processes for Big Data. James Hensman

Part 1: Expectation Propagation

Lecture 13 : Variational Inference: Mean Field Approximation

MCMC for Variationally Sparse Gaussian Processes

Gray-box Inference for Structured Gaussian Process Models: Supplementary Material

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Sparse Variational Inference for Generalized Gaussian Process Models

Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Overdispersed Black-Box Variational Inference

STA 4273H: Sta-s-cal Machine Learning

Ch 4. Linear Models for Classification

STAT 518 Intro Student Presentation

The connection of dropout and Bayesian statistics

STA 4273H: Statistical Machine Learning

The Generalized Reparameterization Gradient

On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes

Deep Neural Networks as Gaussian Processes

Probabilistic Models for Learning Data Representations. Andreas Damianou

Markov Chain Monte Carlo Algorithms for Gaussian Processes

Variational Dependent Multi-output Gaussian Process Dynamical Systems

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Variational Model Selection for Sparse Gaussian Process Regression

Introduction to Machine Learning

Lecture 16 Deep Neural Generative Models

Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

Lecture 6: Graphical Models: Learning

Two Useful Bounds for Variational Inference

13: Variational inference II

Nonparametric Bayesian inference on multivariate exponential families

Variational Dropout and the Local Reparameterization Trick

Density Estimation. Seungjin Choi

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Reading Group on Deep Learning Session 2

Probabilistic & Bayesian deep learning. Andreas Damianou

Gaussian Process Latent Variable Models for Dimensionality Reduction and Time Series Modeling

Bayesian Deep Learning

Approximate Bayesian inference

Variational Scoring of Graphical Model Structures

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Introduction to Gaussian Processes

Reading Group on Deep Learning Session 1

Transcription:

Sparse Approximations for Non-Conjugate Gaussian Process Regression Thang Bui and Richard Turner Computational and Biological Learning lab Department of Engineering University of Cambridge November 11, 014 Abstract Notes: This report only shows some preliminary work on scaling Gaussian process models that use non-gaussian likelihoods. As there are recently arxived papers on the similar idea [1,], this report will stay as is, please consult the two papers above for a proper discussion and experiments. 1 Introduction Gaussian Processes (GPs) is an important tool for probabilistic models in both supervised and unsupervised settings. Typically, GPs are used as priors over unknown continuous functions which govern the trajectory of the observed variables. When the likelihood term is conjugate to this prior, e.g. a Gaussian likelihood, the posterior distribution is analytically tractable. It is however, not tractable for non-conjugate likelihoods, such as a Poisson likelihood in Poisson count regression or a logistic likelihood in binary classification. Approximation inference schemes such as Laplace approximation or Expectation Propagation can be applied in these cases [3]. Critically, these GP based models suffer from a high complexity, typically O(N 3 ) and O(N ) for learning and prediction respectively. When dealing with large datasets, sparse approximation techniques that can reduce the computation demand are preferred. However, most of the approximations were developed for regression problems that use a Gaussian likelihood, and hence often cannot be applied to problems that deal with other forms of likelihood in a straightforward way. This report introduces a variational inference formulation that 1. uses inducing points to augment the model as in current approximations for Gaussian likelihood,. uses variational free enery approach as a basis to find a generic lowerbound for many non-gaussian likelihoods, 3. can perform joint optimization of variational parameters and model hyperparameters, 4. offers a small computational complexity. Review of GPs for regression p(f) = N (f; 0, K ff ) (1) N p(y f) = p(y n f n) () Type Distribution p(y n f n) Continuous Gaussian p(y n f n) = N (y n; f n, σn) Binary Bernoulli logistic p(y n = 1 f n) = σ(f n) Categorical Multinomial logistic p(y n = k f i) = exp(f i,k lse(f i)) Ordinal Cumulative logistic p(y n k f n) = σ(φ k f n) Count Poisson p(y n = k f n) = exp(kfn exp(fn)) k! Continuous Laplace p(y n f n) = 1 b Positive, Real Exponential p(y n f n) = exp( yn fn exp( yn/ exp(fn)) exp(f n) b ), b = σ Table 1: Type of observed variables, possible likelihood functions and their forms. 3 Variational inference and learning 3.1 Formulation Augment the model [4 7] Choose variational distribution for the joint: q(f, u) = q(f u)q(u) and q(f u) = p(f u) [6]. It could be shown that the lower 1

bound using the augmented variational distribution is smaller than the bound obtained if there is no auxiliary variables u [7]. The log marginal likelihood: L = log p(y) (3) = log dudfp(f, u)p(y f) (4) p(f, u)p(y f) = log dudfq(f, u) (5) q(f, u) p(f, u)p(y f) dudfq(f, u) log (6) q(f, u) = dudfq(f, u) log p(f u)p(u)p(y f) (7) p(f u)q(u) = duq(u) log p(u) q(u) + dudfq(f, u) log p(y f) = F (8) The first term in the lower bound above is the negative KL divergence between distributions q(u) and p(u). The second term can be expressed as a sum of N terms, F = N F,n, where F,n = duq(u) df np(f n u) log p(y n f n) (9) = df n log p(y n f n) duq(u)p(f n u). (10) Therefore, the lower bound to the log marginal likelihood can be expressed as, N F = KL(q(u) p(u)) + df nq(f n) log p(y n f n), (11) where q(f n) = duq(u)p(f n u). In the next sections, we will discuss the choice of the variational distribution q(u) such that the KL term and q(f n) can be computed analytically, and how to find a tight or exact bounds of the expectation of the local likelihood under q(f n). We discuss several likelihood classes for which this approximation can be used. 3. Choice of variational distribution If we choose a Gaussian variational distribution, q(u) = N (u; m, C), the first term in the lower bound, that is a negative KL divergence, can be computed exactly, F 1 = KL(q(u) p(u)) = 1 [ tr(k 1 uuc) + (m µ u ) K 1 uu(m µ u ) M log detc ], (1) detk uu where p(u) = N (u; µ u, K uu) is the prior on the inducing variables. Typically, µ u = 0 which leads to a trivial simplication in the above equation, F 1 = KL(q(u) p(u)) = 1 [ tr(k 1 uuc) + m K 1 uum M log detc ]. (13) detk uu The distribution q(u) is parameterised by its mean m and covariance matrix C, which means there are M + M parameters in total. For a large M, this may not be desirable and could lead to problems with the optimisation of the lower bound. TODO: 1. Is there any transformation we could do here to reduce the number of parameters?. Can use a sparse precision parameterisation as in [8]. Choosing a single Gaussian distribution to represent the posterior of u may be too coarse and hence a multimodal distribution such as a mixture of Gaussian distributions may be a better choice for a multimodal posterior distribution [9, 10](TODO: this reduces significantly the number of variational parameters.).we could use a mixture of (uniformly) weighted Gaussians with isotropic covariances, q(u) = 1 K N (u; m k, σki). (14) However, this distribution leads to an intractabily in computing the KL term, as it involves a log of a sum of distributions. Fortunately, we can bound this using Jensen s inequality [9, 11] as follows, F 1 = KL(q(u) p(u)) (15) = duq(u) log q(u) (16) p(u)

= 1 K duq(u) log q(u) + log duq(u) log p(u) (17) duq(u)n (u; m k, σ ki) + 1 K dun (u; m k, σki) log p(u), (18) where we have used the inequality to bound the first term in the equations above. TODO: R. Turner s comment: there s been limited success using mixture of gaussians to parameterise the variational distribution, perhaps because of this bound is not tight! The first term in the bound above involves a convolution of two Gaussian distributions and hence, F 1a = 1 K = 1 K log log 1 K duq(u)n (u; m k, σ ki) (19) k =1 The second term is due to the assumption about the variational distribution, F 1b = 1 K N (m k ; m k, (σk + σ k )I) (0) [ M log detkuu 1 ] (µ u m k) K 1 uu(µ u m k ) + tr(k 1 uuc k ), (1) where C k = σ ki. The expressions presented above are tractable to compute, and its gradient with respect to the variational parameters {m k, σ k} K can be computed in closed-form. Note that when K = 1, the bound above is exact, and we obtain the result of the single Gaussian case presented at the beginning of this section. In what follows, we will present the result for the single Gaussian case, but the derivation for the mixture is straightforward. 3.3 Choice of local bounds We have discussed the choice of q(u) and how to deal with the first term of the bound F. In this section, we provide a treatment for the second term, where N F = df nq(f n) log p(y n f n), () q(f n) = duq(u)p(f n u) (3) = dun (u; m, C)N (f n; a nu, b n) (4) = N (f n; a nm, b n + a nca n), (5) and a n = K fnukuu, 1 b n = k fn,fn K fnuk 1 uuk ufn in the equations above. Let denote µ n = a nm and v n = b n + a nca n. We note that F involves one dimensional integrals which are the expectation of local likelihood terms wrt the distribution q(f n). Since q(f n) is a Gaussian distribution, the local integral can be computed exactly or can be bounded so that a tractable lower bound of F can be used instead of the original intractable integrals. 3.3.1 Poisson likelihood for count data regression Consider, which means, p(y n = k f n) = exp(ynfn exp(fn)), (6) y n! log p(y n f n) = y nf n exp(f n) log y n!. (7) Therefore, F,n = = df nq ( f n) log p(y n f n) (8) df nn (f n; µ n, v n) [y nf n exp(f n) log y n!] (9) 3

= y nµ n exp(µ n + vn ) log yn! (30) = y na nm exp( bn + anm + anca n ) log y n! (31) In words, F can be computed analytically when the likelihood is a Poisson distribution. We have used the exponential reverse-link function here. See [] for a treatment for a different reverse-link function. 3.3. Bernoulli logistic likelihood for logit regression or binary classification The bound of the log partition function can be found using Taylor s approximation [1] or the Fenchel inequality [13]. These bounds have been used extensively, but recently shown that they are not tight, especially, when being converted back to the logistic function [14]. Motivated by the performance of the piecewise linear/quadratic bound in [14, 15], we apply this technique for Bernoulli logistic and ordinaegression problems, in this section and next section respectively. Consider the logistic function for binary classification or logit regression, The log likelihood is, p(y n = 1 f n) = 1 1 + exp( f = exp(ynfn) n) 1 + exp(f. (3) n) log p(y n f n) = y nf n log(1 + exp(f n)) (33) Using the piecewise bound in [14], we obtain, F,n = df nq(f n) [y nf n log(1 + exp(f n)] (34) y nµ n df nq(f n)bound(log(1 + exp(f n)) (35) = y nµ n = y nµ n R r=1 R r=1 h r df nq(f n)(a n,rf n + b n,rf n + c n,r) (36) a n,re hr [f n µ n, v n] + b n,re hr [f n µ n, v n] + c n,re hr [f 0 n µ n, v n], (37) where E h l [f m µ, σ ] = h dfn (f; µ, σ )f m. Importantly, the gradients of the expectations in the equations above wrt the l mean and variance of the integrating distribution {µ n, v n}, can be computed, which means the gradients wrt the variational parameters and the inducing inputs x u can be computed. The forms of the truncated Gaussian moments can be found in, e.g. [16, pp. 144-145]. 3.3.3 Cumulative logistic likelihood for ordinaegression The following is an extension of [16, Ch. 6, pp. 116-117]. Consider the cumulative logistic likelihood, which means p(y n k f n) = σ(φ k f n), (38) hence its log, p(y n = k f n) = σ(φ k f n) σ(φ k 1 f n) (39) = e fn (e φ k 1 e φ k ) [1 + e (φ k 1 f n) ][1 + e (φ k 1 f n) ] (40) log p(y n = k f n) = f n log(e φ k 1 e φ k ) llp(f n φ k ) llp(f n φ k 1 ). (41) Similar to the previous section, we can obtain the bound on the llp function using R linear/quadratic pieces. The expectation of these pieces under a one dimensional Gaussian and their corresponding gradients can also be obtained as in the above sections. 4

3.3.4 Laplace likelihood Consider the Laplace likelihood, Taking the log gives, p(y n f n) = 1 b yn fn exp( ) (4) b yn fn log p(y n f n) = log(b) (43) b { yn+fn log(b) if y = b n f n y n f n (44) log(b) if y b n < f n Hence the intergral of log p(y n f n) is analytically tractable and involves evaluating the Gaussian cummulative distribution... TODO 3.3.5 Exponential likelihood Consider the exponential likelihood, Taking the log gives, p(y n f n) = exp( yn/ exp(fn)) exp(f n) (45) log p(y n f n) = yn fn, (46) exp(f n) which means the intergral will be tractable as in the case of the Poisson likelihood with the exponential inverse-link function above. 3.3.6 Multinomial logistic likelihood for categoricaegression or multiclass classification Consider the likelihood for multinomial logistic regression, TODO: [15 18] 3.3.7 Other likelihood classes that cannot be analytically bounded p(y n = k f i) = exp(f i,k lse(f i)) (47) If the bound on the log likelihood term g(f n) = log(y n f n) cannot be found analytically, we can use stochastic optimisation to obtain the unbiased estimate of F,n,k and its gradients wrt the mean and variance of the distribution q k (f n). That is, F,n,k 1 T T g(f t), f t q k (f n) (48) t=1 Furthermore, if q k (f n) is parameterised by the mean µ n and covariance v n, the gradient of the expectation can be expressed as an expectation of a gradient. This is due to Price s theorem [19] and Bonnet s theorem [0], which have been employed recently for deep models [8]. That is, d F,n,k = d E qk (f dµ n,k dµ n)[g(f n)] = E qk (f n) n,k d dv n,k F,n,k = d dv n,k E qk (f n)[g(f n)] = E qk (f n) [ ] d g(f n) df n [ ] d g(f n) dfn The above equation means that the gradients can be estimated using a Monte Carlo summation. following theorem is that log p(y n f n) needs to be twice differentiable wrt f n. Two potential issues with this approach is: Samples drawn from q k (u) may not be important as they fall into regions in which g n(u) is too small. (49) (50) (51) The condition for the Variance of the unbiased estimates may be large, TODO: variance reduction techniques?: Rao-Blackwellisation, change of variables?, read [8, 1]. However, the problem may not be as worse as it sounds, as the expectation above is only one dimensional and therefore, the number of samples needed is much smaller compared to similar approaches in high dimensions [8, ]. 5

3.4 Complexity O(NM ) 4 Can this be applied to non-conjugate GPLVMs? The answer is no if you are looking for an analytic solution because of the non-linear dependencies of X in p(f U), but stochastic approximation could be used. Again, getting this to work in high dimensions could be challenging. References [1] J. Hensman, A. Matthews, and Z. Ghahramani, Scalable variational Gaussian process classification, 014. [] C. Lloyd, T. Gunter, M. A. Osborne, and S. J. Roberts, Variational inference for Gaussian process modulated Poisson processes, 014. [3] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 005. [4] J. Quiñonero-Candela and C. E. Rasmussen, A unifying view of sparse approximate Gaussian process regression, The Journal of Machine Learning Research, vol. 6, pp. 1939 1959, 005. [5] E. Snelson and Z. Ghahramani, Sparse Gaussian processes using pseudo-inputs, in Advances in Neural Information Processing Systems 19, pp. 157 164, MIT press, 006. [6] M. K. Titsias, Variational learning of inducing variables in sparse Gaussian processes, in International Conference on Artificial Intelligence and Statistics, pp. 567 574, 009. [7] T. Salimans, Markov chain Monte Carlo and variational inference: Bridging the gap, 014. [8] D. J. Rezende, S. Mohamed, and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, in Proceedings of the 31st International Conference on Machine Learning, pp. 178 186, 014. [9] S. Gershman, M. Hoffman, and D. M. Blei, Nonparametric variational inference, in Proceedings of the 9th International Conference on Machine Learning, pp. 663 670, 01. [10] T. Nguyen and E. Bonilla, Automated variational inference for Gaussian process models, in Advances in Neural Information Processing Systems (N. Lawrence and C. Cortes, eds.), vol. 6, 014. [11] M. Huber, T. Bailey, H. Durrant-Whyte, and U. Hanebeck, On entropy approximation for Gaussian mixture random vectors, in Multisensor Fusion and Integration for Intelligent Systems, 008. MFI 008. IEEE International Conference on, pp. 181 188, Aug 008. [1] D. Böhning, Multinomial logistic regression algorithm, Annals of the Institute of Statistical Mathematics, vol. 44, no. 1, pp. 197 00, 199. [13] T. S. Jaakkola and M. I. Jordan, Bayesian parameter estimation via variational methods, Statistics and Computing, vol. 10, no. 1, pp. 5 37, 000. [14] B. M. Marlin, M. E. Khan, and K. P. Murphy, Piecewise bounds for estimating Bernoulli-logistic latent Gaussian models, in Proceedings of the 8th International Conference on Machine Learning, 011. [15] M. E. Khan, S. Mohamed, and K. P. Murphy, Fast Bayesian inference for non-conjugate Gaussian process regression, in Advances in Neural Information Processing Systems, pp. 3140 3148, 01. [16] M. E. Khan, Variational learning for latent Gaussian models of discrete data. PhD thesis, The University of British Columbia, 01. [17] G. Bouchard, Efficient bounds for the softmax function and applications to inference in hybrid models, in NIPS 007 Workshop on Approximate Bayesian Inference in Continuous/Hybrid Systems, 007. [18] D. Blei and J. Lafferty, Correlated topic models, Advances in neural information processing systems, vol. 18, p. 147, 006. [19] R. Price, A useful theorem for nonlinear devices having Gaussian inputs, Information Theory, IRE Transactions on, vol. 4, pp. 69 7, June 1958. 6

[0] G. Bonnet, Transformations des signaux aléatoires a travers les systèmes non linéaires sans mémoire, Annales des Télécommunications, vol. 19, no. 9-10, pp. 03 0, 1964. [1] R. Ranganath, S. Gerrish, and D. Blei, Black box variational inference, in Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 814 8, 014. [] M. Titsias and M. Lázaro-Gredilla, Doubly stochastic variational bayes for non-conjugate inference, in Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1971 1979, 014. 7