State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018
Outline Gaussian Processes Temporal GPs as stochastic differential equations (SDEs) Learning and inference with Gaussian Likelihoods Speeding up computation of state space model parameters Non-Gaussian likelihoods Approximate inference algorithms Computational primitives and how to compute them Experiments 2/ 14
Def: Gaussian Processes (GPs) Gaussian Process (GP) is a stochastic process where for any inputs t all corresponding outputs y are distributed as y N (m(t), K (t, t θ)). Denoted: f (t) GP(m(t), k(t, t θ)) Used as a prior over continuous functions in statistical models Properties (e.g. smoothness) are determined by the covariance function k(t, t θ) 3/ 14
Temporal Gaussian Processes Input data is 1-D, usually time Fully probabilistic (Bayesian) approach Conveniently combining structural components by covariance operations Challenges: Large datasets Non-Gaussian likelihoods Applicability for unevenly sampled data 4/ 14
Temporal Gaussian Processes Input data is 1-D, usually time Fully probabilistic (Bayesian) approach Conveniently combining structural components by covariance operations Challenges: Large datasets Non-Gaussian likelihoods Applicability for unevenly sampled data 4/ 14
GP as a Stochastic Differential Equation (SDE) Addressing challenge 1 Given a 1-D time series: {y i, t i } N i=1 Gaussian Process model: f (t) GP(m(t), k(t, t )) n y f P(y i f (t i )) i=1 GP prior Likelihood Equivalent Stochastic Differential Equation (SDE) [3] d f(t) d t y f = Ff(t) + Lw(t); f 0 N (0, P ) n P(y i Hf(t i )) i=1 Latent Posterior: Q(f D) = ( N f m + Kα, (K 1 + W) 1) f (t) = Hf(t) w(t) - multidimensional white noise F, L, H, P are determined from the covariance K [3] 5/ 14
GP as a Stochastic Differential Equation (SDE) Addressing challenge 1 Given a 1-D time series: {y i, t i } N i=1 Gaussian Process model: f (t) GP(m(t), k(t, t )) n y f P(y i f (t i )) i=1 GP prior Likelihood Equivalent Stochastic Differential Equation (SDE) [3] d f(t) d t y f = Ff(t) + Lw(t); f 0 N (0, P ) n P(y i Hf(t i )) i=1 Latent Posterior: Q(f D) = ( N f m + Kα, (K 1 + W) 1) f (t) = Hf(t) w(t) - multidimensional white noise F, L, H, P are determined from the covariance K [3] 5/ 14
Inference and Learning with Gaussian likelihood Gaussian likelihood: P(y i f (t i )) = N (y i f (t i ), σ 2 ni) Posterior parameters: W = σ 2 I n Evidence: α = (K + W 1 ) 1 (y m) log Z GPR = 1 2 α (y m) 1 2 log K + W 1 N 2 log(2πσ2 n) The naïve approach has O(N 3 ) complexity Solve SDE between time points (equivalent discrete time model): f i = A i 1 f i 1 + q i 1 ; q i 1 N (0, Q i 1 ) y i = Hf i + ɛ i ; ɛ n N (0, σ 2 n) Parameters of the discrete model: A i = A[ t i ] = e t i F, Q i = P A i P A i Inference and learning by Kalman FIlter (KF) and Rauch-Tung-Striebel (RTS) smoother in O(N) complexity 6/ 14
Inference and Learning with Gaussian likelihood Gaussian likelihood: P(y i f (t i )) = N (y i f (t i ), σ 2 ni) Posterior parameters: W = σ 2 I n Evidence: α = (K + W 1 ) 1 (y m) log Z GPR = 1 2 α (y m) 1 2 log K + W 1 N 2 log(2πσ2 n) The naïve approach has O(N 3 ) complexity Solve SDE between time points (equivalent discrete time model): f i = A i 1 f i 1 + q i 1 ; q i 1 N (0, Q i 1 ) y i = Hf i + ɛ i ; ɛ n N (0, σ 2 n) Parameters of the discrete model: A i = A[ t i ] = e t i F, Q i = P A i P A i Inference and learning by Kalman FIlter (KF) and Rauch-Tung-Striebel (RTS) smoother in O(N) complexity 6/ 14
log Z log Z KL = [ Fast computation of A i and Q i by interpolation Problem: 1 2 When there are many t i 2.9. Assumed density filtering (ADF) a.k.a. parameters computation can be slow Solution: α mvm K(α) + ld K(W) 2 i ψ : s e sx 0,..., 2 identical moments is smooth zi k = fi k Here, Q mapping, i(f i) = hence N (f m, K) j i interpolation (similar to KISS-GP [4]) Evaluate ψ on an equispaced log Z ADF = grid [ s 1, s 1 2,.., s K, where 2 s j = s 0 + j s l KL(f i) 2ρ KL ] where the remainder ρ KL = tr(w ld K(W)/ W) can be computed using computational primitive 4. single-sweep Expectation propagation (EP) In expectation propagation (EP) (Minka, 2001), the non- Gaussian likelihoods P(y i f i) are replaced by unnormalized Gaussians t i(f i) = exp(b if i W iifi 2 /2) and their parameters (b i, W ii) are iteratively (in multiple passes) updated such that Q i(f i)p(y i f i) and Q i(f i)t(f i) have k = Q i(fi)t(fi) dfi. tj(fj) df i denotes the cavity distribution. Unlike full state space EP using forward and backward passes (Heskes & Zoeter, 2002), there is a single-pass variant doing only one forward sweep that is know as assumed density filtering (ADF). It is very simple to implement in the GP setting. In fact, ADF is readily implemented by Algorithm 2 when the flag ADF is switched on. The marginal likelihood approximation takes the form α mvm K(α) + ld K(W) 2 i log z 0 i 2ρ ADF ] where the remainder ρ ADF collects a number of scalar terms depending on (b, W, α, m). 3. Experiments The experiments focus on showing that the state space formulation delivers the exactness of the full naïve solution, but with appealing computational benefits, and wide appli-,, Relative absolute difference in log Z 10 14 10 10 10 6 10 Use 4-point interpolation: A c 1 A j 1 + c 2 A j + c 3 A j+1 + c 4 A j+2. 0 20 40 60 80 100 Coefficients {c i } 4 120 i=1 are 140 160 Number of interpolation grid points, K efficiently computable Figure 1. Relative differences in log Z with different approximation grid sizes for Ai and Qi, K, of solving a GP regression problem. Results calculated over 20 independent repetitions, mean±min/max errors visualized. Evaluation time (s) 0 2 4 6 8 10 Naïve State space State space (K = 2000) State space (K = 10) 5 10 15 20 Number of training inputs, n 10 3 Figure 2. Empirical computational times of GP prediction using the GPML toolbox implementation as a function of number of training inputs, n, and degree of approximation, K. For all four methods the maximum absolute error in predicted Non-Gaussian means was State 10 9 Space. GPs Results calculated over ten independent runs. 7/ 14 noise: y i = 0.2 sin(2π t i + 2) + 0.5 sin(0.6π t i + 0.13) +
Non-Gaussian Likelihoods Addressing challenge 2 Posterior as a Gaussian approximation: Q(f D) = N ( f m + Kα, (K 1 + W) 1) Laplace approximation (LA) Variational Bayes (VB) Direct Kullback-Liebler minimization (KL) Assumed Density Filtering (ADF) a.k.a. single sweep Expectation Propagation (EP) Laplace Approximation log P(f D) log P(f y) + log P(f t) Find the mode ˆf of this function by Newton method Hessian at the mode ˆf is precision W = 2 log P(ˆf t) [ log Z LA = 1 α mvm 2 K (α) + ld K (W) 2 ] i log P(y i ˆf i ) 8/ 14
Computational Primitives The following computational primitives allow to cast the covariance approximation in more generic terms: Linear system solving: solve K (W, r) := (K + W 1 ) 1 r Matrix-vector multiplications: mvm K (r) := Kr Log-determinants: ld K (W) := log B with well-conditioned B = I + W 1 2 K W 1 2 Predictions need latent mean E[f ] and variance V[f ] 9/ 14
Tackling computational primitives Using state space from of temporal GPs SpInGP: The first two computational primitives are calculated using SpInGP [5] approach: Idea is: using state space form compose the inverse of the covariance matrix, which turns out to be block-tridiagonal KF and RTS Smoothing: The last two primitives are solved by Kalman filtering and RTS smoothing Predictions are computed by primitive 4 and then by propagation through likelihood Comments: Derivatives of computational primitives, required for learning, are computed in a similar way SpInGP involves computations with block-tridiagonal matrices. These computations are similar to KF and RTS smoothing (see [1] Appendix) 10/ 14
Experiments 2-3 Experiments are designed to emphasize the paper findings and statements 1. A robust regression (Student s t likelihood) study example with n = 34,154 observations 2. Numerical effects in non-gaussian likelihoods 11/ 14
Experiment 4 A new interesting data set with commercial airline accidents dates scraped from Wikipedia [6] Accidents over the time-span of 100 years, n = 35,959 days We model the accident intensity as a Log Gaussian Cox process (Poisson likelihood) The GP prior is set up as: k(t, t ) = k Mat. (t, t ) + k per. (t, t ) k Mat. (t, t ) 12/ 14
Conclusions This paper brings together research done in state space GPs and non-gaussian approximate inference We improve stability and provide additional speed-up by fast computations of the state space model parameters We provide unifying code for all approches in GPML toolbox v. 4.2 [7] Visit our poster: #151 13/ 14
References [1] H. Nickisch, A. Solin, and A. Grigorievskiy (2018). State Space Gaussian Processes with Non-Gaussian Likelihood. In ICML. [2] C.E. Rasmussen and C.K.I. Williams (2006). Gaussian Processes for Machine Learning. The MIT Press. [3] J. Hartikainen, and S. Särkkä (2010). Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In MLSP. [4] A. G. Wilson, and H. Nickisch (2015). Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP). In ICML. [5] A. Grigorievskiy, N. Lawrence, and S. Särkkä (2017). Parallelizable Sparse Inverse Formulation Gaussian Processes (SpInGP). In MLSP. [6] Wikipedia (2018). URL https://en.wikipedia.org/wiki/list_of_ accidents_and_incidents_involving_commercial_aircraft [7] C. E. Rasmussen and H. Nickisch (2010). Gaussian Processes for Machine Learning (GPML). In GMLR. 14/ 14