Non-Gaussian likelihoods for Gaussian Processes

Size: px

Start display at page:

Download "Non-Gaussian likelihoods for Gaussian Processes"

May Day
5 years ago
Views:

1 Non-Gaussian likelihoods for Gaussian Processes Alan Saul University of Sheffield

2 Outline Motivation Laplace approximation KL method Expectation Propagation Comparing approximations

3 GP regression Model the observations, y, as a distorted version of the process, f, y i N ( f (x i ), σ 2) f is a non-linear function, in our case we assume it is latent, and is assigned a Gaussian process prior. 8 Realizations of f(x) Observations 95% credible intervals for p(y y)

4 GP regression setting So far we have assumed that the latent values, f, have been corrupted by Gaussian noise. Everything remains analytically tractable. Gaussian Prior: f GP(0, K ff ) = p(f) Gaussian likelihood: y N ( f, σ 2 I ) n = p(y i f i ) Gaussian posterior: p(f y) N ( y f, σ 2 I ) N (f 0, K ff ) i=1

5 Likelihood p(y f) is the probability of the observed data, if we know the latent function values f. Can also be seen as the likelihood that the latent function values, f, would give rise to some observed data, y. So far assumed that the distortion of the underlying latent function, f, that gives rise to the observed data, y, is independent and Gaussianly distributed. This is often not the case, count data, binary data, etc.

6 Binary example Binary outcomes for y i, y i [0, 1]. Model the probability of y i = 1 with transformation of GP. Probability of 1 must be between 0 and 1, thus use squashing transformation, λ(f i ) = Φ(f i ). y i = { 1, with probability λ(fi ). 0, with probability 1 λ(f i ). Realizations of Φ(f(x)) Realizations of f(x) Observations

7 Count data example Non-negative and discrete values only for y i, y i N. Model the rate or intensity, λ, of events with a transformation of a Gaussian process. Rate parameter must remain positive, use transformation to maintain positiveness λ(f i ) = exp(f i ) or λ(f i ) = f 2 i y i Poisson(y i λ i = λ(f i )) Poisson(y i λ i ) = λy i i e λ i!y i 25 Realizations of exp(f(x)) Observations 95% credible intervals for p(y y)

8 Non-Gaussian posteriors Exact computation of posterior is no longer analytically tractable due to non-conjugate Gaussian process prior to non-gaussian likelihood, p(y f). p(f y) = p(f) n i=1 p(y i f i ) p(f) n i=1 p(y i f i ) df Various methods to make a Gaussian approximation, q(f) p(f y). Allows simple predictions p(f y) = p(f f)p(f y)df p(f f)q(f)df

9 Laplace approximation Find the mode of the true log posterior, via Newton s method. Use second order Taylor expansion around this modal value. i.e obtain curvature at this point. Form Gaussian approximation setting the mean equal to the posterior mode, ˆf, and matching the curvature. p(f y) q(f µ, C) = N ( f ˆf, (K 1 ff + W ) 1) W d2 log p(y ˆf) dˆf 2. For factorizing likelihoods (most), W is diagonal.

10 Visualization of Laplace 0.30 prior f

11 Visualization of Laplace 0.30 prior likelihood f

12 Visualization of Laplace prior likelihood posterior f

13 Visualization of Laplace prior likelihood posterior f

14 Visualization of Laplace prior likelihood posterior f

15 Visualization of Laplace prior likelihood posterior f

16 Visualization of Laplace prior likelihood posterior f

17 Visualization of Laplace prior likelihood posterior f

18 Visualization of Laplace prior likelihood posterior f

19 Visualization of Laplace 0.6 Evaluate curvature 0.5 prior likelihood posterior f

20 Visualization of Laplace 0.6 Evaluate curvature 0.5 prior likelihood posterior laplace f

21 KL-method Make a Gaussian approximation, q(f) = N (f µ, C), as similar possible to true posterior, p(f y). Treat µ and C as variational parameters, effecting quality of approximation. Define a divergence measure between two distributions, KL divergence, KL ( q(f) p(f y) ). Minimize this divergence between the two distributions (Nickisch and Rasmussen, 2008).

22 KL divergence General for any two distributions q(x) and p(x). KL ( q(x) p(x) ) is the average additional amount of information required to specify the values of x as a result of using an approximate distribution q(x) instead of the true distribution, p(x). KL ( q(x) p(x) ) = log q(x) p(x) q(x) Always 0 or positive, not symmetric. Lets look at how it changes with response to changes in the approximating distribution.

23 KL varying mean pdf q(x) ~ N(µ = 2.0,σ 2 =1.0) p(x) ~ N(,1.0) x KL[q(x) p(x)] µ

24 KL varying mean pdf q(x) ~ N(µ = 1.0,σ 2 =1.0) p(x) ~ N(,1.0) x KL[q(x) p(x)] µ

25 KL varying mean pdf q(x) ~ N(µ =,σ 2 =1.0) p(x) ~ N(,1.0) x KL[q(x) p(x)] µ

26 KL varying mean pdf q(x) ~ N(µ =1.0,σ 2 =1.0) p(x) ~ N(,1.0) x KL[q(x) p(x)] µ

27 KL varying mean pdf q(x) ~ N(µ =2.0,σ 2 =1.0) p(x) ~ N(,1.0) x KL[q(x) p(x)] µ

28 KL varying variance pdf q(x) ~ N(µ =,σ 2 =0.3) p(x) ~ N(,1.0) x KL[q(x) p(x)] σ 2

29 KL varying variance pdf q(x) ~ N(µ =,σ 2 =0.725) p(x) ~ N(,1.0) x KL[q(x) p(x)] σ 2

30 KL varying variance pdf q(x) ~ N(µ =,σ 2 =1.15) p(x) ~ N(,1.0) x KL[q(x) p(x)] σ 2

31 KL varying variance pdf q(x) ~ N(µ =,σ 2 =1.575) p(x) ~ N(,1.0) x KL[q(x) p(x)] σ 2

32 KL varying variance pdf q(x) ~ N(µ =,σ 2 =2.0) p(x) ~ N(,1.0) x KL[q(x) p(x)] σ 2

33 KL-method derivation Assume Gaussian approximate posterior, q(f) = N (f µ, C). True posterior using Bayes rule, p(f y) = p(y f)p(f) p(y). Cannot compute the KL divergence as we cannot compute the true posterior, p(f y). KL ( q(f) p(f y) ) = log q(f) p(f y) q(f) = log q(f) log p(y f) + log p(y) p(f) = KL ( q(f) p(f) ) log p(y f) q(f) + log p(y) log p(y) = log p(y f) q(f) KL ( q(f) p(f) ) + KL ( q(f) p(f y) ) q(f)

34 KL-method derivation log p(y) = log p(y f) q(f) KL ( q(f) p(f) ) + KL ( q(f) p(f y) ) log p(y f) q(f) KL ( q(f) p(f) ) Tractable terms give lower bound on log p(y) as KL ( q(f) p(f y) ) always positive. Adjust variational parameters µ and C to make tractable terms as large as possible, thus KL ( q(f) p(f y) ) as small as possible. log p(y f) q(f) with factorizing likelihood can be done with a series of n 1 dimensional integrals. In practice, can reduce the number of variational parameters by reparameterizing C = (K ff 2Λ) 1 by noting that the bound is constant in off diagonal terms of C.

35 Expectation Propagation p(f y) p(f) n p(y i f i ) i=1 q(f y) 1 Z ep p(f) n i=1 t i Z i N ( f i µ i, σ 2 i t i (f i Z i, µ i, σ 2 i ) = N (f µ, Σ) ) Individual likelihood terms, p(y i f i ), replaced by independent local likelihoods, t i. Uses an iterative algorithm to update t i s.

36 Expectation Propagation 1. From the current posterior, q(f y), leave out one of the local likelihoods, t i, then marginalize out f j i, giving rise to the cavity distribution, q i (f i ). 2. Combine cavity distribution, q i (f i ), with exact likelihood contribution, p(y i f i ), giving non-gaussian un-normalized distribution, ˆq(f i ) p(y i f i )q i (f i ). 3. Choose a un-normalized Gaussian approximation to this distribution, N ( ) f i ˆµ i, ˆσ 2 Ẑi, by finding moments of ˆq(f i i ). 4. Replace parameters of t i with those that produce the same moments as this approximation. 5. Choose another i and start again. Repeat to convergence.

37 Expectation Propagation - in math Step 1. First choose a marginal, i, to focus on, then q(f y) p(f) n j=1 t j (f j ) p(f) n j=1 t j(f j ) n p(f) t j (f j ) t i (f i ) j i p(f) t j (f j ) df j i q i (f i ) j i Step 2. p(y i f i )q i (f i ) ˆq(f i ) ) Ẑi Step 3. ˆq(f i ) N ( f i ˆµ i, ˆσ 2 i Step 4: Compute parameters of t i (f i Z i, µ i, σ 2 ) making moments i of q(f i ) match those of Ẑ i N ( ) f i ˆµ i, ˆσ 2 i.

38 Comparing posterior approximations Prior p(f 1,f 2 ) Likelihood p(y =1 f 1,f 2 ) f 2 0 f f f 1 Gaussian prior between two function values {f 1, f 2 }, at {x 1, x 2 } respectively. Bernoulli likelihood, y 1 = 1 and y 2 = 1.

39 Comparing posterior approximations True posterior Laplace approximation f 2 0 f f f 1 True posterior is non-gaussian. Laplace approximates with a Gaussian at the mode of the posterior.

40 Comparing posterior approximations True posterior KL approximation f 2 0 f f f 1 True posterior is non-gaussian. KL approximate with a Gaussian that has minimal KL divergence, KL ( q(f) p(f y) ). This leads to distributions that avoid regions in which p(f y) is small. It has a large penality for assigning density where there is none.

41 Comparing posterior approximations True posterior EP approximation f 2 0 f f f 1 True posterior is non-gaussian. EP tends to try and put density where p(f y) is large Cares less about assigning density density where there is none. Contrasts to KL method.

42 Comparing posterior marginal approximations Marginals for f 2 compared laplace posterior EP KL Laplace: Poor approximation. KL: Avoids assigning density to areas where there is none, at the expense of areas where there is some (right tail). EP: Assigns density to areas with density, at the expense of areas where there is none (left tail).

43 Pros - Cons - When - Laplace Laplace approximation Pros Very fast. Cons Poor approximation if the mode does not well describe the posterior, for example Bernoulli likelihood (probit). When When the posterior is well characterized by its mode, for example Poisson.

44 Pros - Cons - When - KL KL method Pros Principled in that it we are directly optimizing a measure of divergence between an approximation and true distribution. Can be relatively quick, and lends it self to sparse approximations (Hensman et al., 2015). Cons Requires factorizing likelihoods to avoid n dimensional integral. When Laplace approximation poor or likelihood is not Bernoulli.

45 Pros - Cons - When - EP EP method Pros Very effective for certain likelihoods (classification). Cons Slow though possible to extend to sparse case. Convergence issues for certain likelihoods. Must be able to match moments. When Binary data (Nickisch and Rasmussen, 2008; Kuß, 2006), perhaps with truncated likelihood (censored data) (Vanhatalo et al., 2015).

46 Pros - Cons - When - MCMC MCMC methods Pros Theoretical limit gives true distribution Cons Can be slow when data is large. When If time is not an issue, but exact accuracy is. If you are unsure whether a different approximation is appropriate, can be used as a ground truth

47 Questions Thanks for listening. Any questions?

48 References I Hensman, J., Matthews, A. G. D. G., and Ghahramani, Z. (2015). Scalable variational gaussian process classification. In In 18th International Conference on Artificial Intelligence and Statistics, pages 1 9, San Diego, California, USA. Kuß, M. (2006). Gaussian Process Models for Robust Regression, Classification, and Reinforcement Learning. PhD thesis, TU Darmstadt. Nickisch, H. and Rasmussen, C. E. (2008). Approximations for Binary Gaussian Process Classification. Journal of Machine Learning Research, 9: Vanhatalo, J., Riihimaki, J., Hartikainen, J., Jylanki, P., Tolvanen, V., and Vehtari, A. (2015). Gpstuff.

Gaussian Process Regression with Censored Data Using Expectation Propagation

Sixth European Workshop on Probabilistic Graphical Models, Granada, Spain, 01 Gaussian Process Regression with Censored Data Using Expectation Propagation Perry Groot, Peter Lucas Radboud University Nijmegen,