State Space Gaussian Processes with Non-Gaussian Likelihoods

Similar documents
arxiv: v5 [stat.ml] 5 Jul 2018

State Space Representation of Gaussian Processes

arxiv: v4 [stat.ml] 28 Sep 2017

Non-Gaussian likelihoods for Gaussian Processes

Lecture 6: Bayesian Inference in SDE Models

Model Selection for Gaussian Processes

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Nonparameteric Regression:

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Nonparametric Bayesian Methods (Gaussian Processes)

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Gaussian Process Approximations of Stochastic Differential Equations

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Part 1: Expectation Propagation

Expectation Propagation in Dynamical Systems

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Gaussian Process Approximations of Stochastic Differential Equations

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Student-t Process Regression with Student-t Likelihood

Gaussian Process Regression with Censored Data Using Expectation Propagation

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Introduction to Gaussian Processes

Factor Analysis and Kalman Filtering (11/2/04)

Expectation Propagation for Approximate Bayesian Inference

Variational Model Selection for Sparse Gaussian Process Regression

Expectation propagation as a way of life

Gaussian Processes for Machine Learning (GPML) Toolbox

GWAS V: Gaussian processes

An Introduction to Gaussian Processes for Spatial Data (Predictions!)

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Pattern Recognition and Machine Learning

Expectation Propagation Algorithm

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Deep Learning

Variational Inference via Stochastic Backpropagation

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Probabilistic Graphical Models

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

SPATIO-temporal Gaussian processes, or Gaussian fields,

Relevance Vector Machines

Learning Gaussian Process Kernels via Hierarchical Bayes

The Kalman Filter ImPr Talk

Probabilistic & Bayesian deep learning. Andreas Damianou

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Probabilistic Graphical Models Lecture 20: Gaussian Processes

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Learning Gaussian Process Models from Uncertain Data

Lecture 7: Optimal Smoothing

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Expectation propagation for signal detection in flat-fading channels

Hilbert Space Methods for Reduced-Rank Gaussian Process Regression

Density Estimation. Seungjin Choi

Introduction to Machine Learning

System identification and control with (deep) Gaussian processes. Andreas Damianou

Probabilistic Graphical Models

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

arxiv: v2 [cs.sy] 13 Aug 2018

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Modelling Transcriptional Regulation with Gaussian Processes

GAUSSIAN PROCESS REGRESSION

arxiv: v1 [stat.ml] 25 Oct 2016

Disease mapping with Gaussian processes

Active and Semi-supervised Kernel Classification

Gaussian processes for inference in stochastic differential equations

LA Support for Scalable Kernel Methods. David Bindel 29 Sep 2018

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

The Variational Gaussian Approximation Revisited

Gaussian Process Regression Forecasting of Computer Network Conditions

A short introduction to INLA and R-INLA

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Variable sigma Gaussian processes: An expectation propagation perspective

Gaussian Process Regression

Bayesian Machine Learning

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Gaussian Processes as Continuous-time Trajectory Representations: Applications in SLAM and Motion Planning

Markov Chain Monte Carlo Algorithms for Gaussian Processes

Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Integrated Non-Factorized Variational Inference

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

Probabilistic Graphical Models

Black-box α-divergence Minimization

Gaussian Process Regression Networks

Circular Pseudo-Point Approximations for Scaling Gaussian Processes

State-Space Inference and Learning with Gaussian Processes

Gaussian Processes for Audio Feature Extraction

Efficient Variational Inference in Large-Scale Bayesian Compressed Sensing

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

1 Bayesian Linear Regression (BLR)

Model Based Clustering of Count Processes Data

Gaussian Processes (10/16/13)

Fast Kronecker Inference in Gaussian Processes with non-gaussian Likelihoods

Optimization of Gaussian Process Hyperparameters using Rprop

Distributed Bayesian Learning with Stochastic Natural-gradient EP and the Posterior Server

Auto-Encoding Variational Bayes

Linear Dynamical Systems

Transcription:

State Space Gaussian Processes with Non-Gaussian Likelihoods Hannes Nickisch 1 Arno Solin 2 Alexander Grigorievskiy 2,3 1 Philips Research, 2 Aalto University, 3 Silo.AI ICML2018 July 13, 2018

Outline Gaussian Processes Temporal GPs as stochastic differential equations (SDEs) Learning and inference with Gaussian Likelihoods Speeding up computation of state space model parameters Non-Gaussian likelihoods Approximate inference algorithms Computational primitives and how to compute them Experiments 2/ 14

Def: Gaussian Processes (GPs) Gaussian Process (GP) is a stochastic process where for any inputs t all corresponding outputs y are distributed as y N (m(t), K (t, t θ)). Denoted: f (t) GP(m(t), k(t, t θ)) Used as a prior over continuous functions in statistical models Properties (e.g. smoothness) are determined by the covariance function k(t, t θ) 3/ 14

Temporal Gaussian Processes Input data is 1-D, usually time Fully probabilistic (Bayesian) approach Conveniently combining structural components by covariance operations Challenges: Large datasets Non-Gaussian likelihoods Applicability for unevenly sampled data 4/ 14

Temporal Gaussian Processes Input data is 1-D, usually time Fully probabilistic (Bayesian) approach Conveniently combining structural components by covariance operations Challenges: Large datasets Non-Gaussian likelihoods Applicability for unevenly sampled data 4/ 14

GP as a Stochastic Differential Equation (SDE) Addressing challenge 1 Given a 1-D time series: {y i, t i } N i=1 Gaussian Process model: f (t) GP(m(t), k(t, t )) n y f P(y i f (t i )) i=1 GP prior Likelihood Equivalent Stochastic Differential Equation (SDE) [3] d f(t) d t y f = Ff(t) + Lw(t); f 0 N (0, P ) n P(y i Hf(t i )) i=1 Latent Posterior: Q(f D) = ( N f m + Kα, (K 1 + W) 1) f (t) = Hf(t) w(t) - multidimensional white noise F, L, H, P are determined from the covariance K [3] 5/ 14

GP as a Stochastic Differential Equation (SDE) Addressing challenge 1 Given a 1-D time series: {y i, t i } N i=1 Gaussian Process model: f (t) GP(m(t), k(t, t )) n y f P(y i f (t i )) i=1 GP prior Likelihood Equivalent Stochastic Differential Equation (SDE) [3] d f(t) d t y f = Ff(t) + Lw(t); f 0 N (0, P ) n P(y i Hf(t i )) i=1 Latent Posterior: Q(f D) = ( N f m + Kα, (K 1 + W) 1) f (t) = Hf(t) w(t) - multidimensional white noise F, L, H, P are determined from the covariance K [3] 5/ 14

Inference and Learning with Gaussian likelihood Gaussian likelihood: P(y i f (t i )) = N (y i f (t i ), σ 2 ni) Posterior parameters: W = σ 2 I n Evidence: α = (K + W 1 ) 1 (y m) log Z GPR = 1 2 α (y m) 1 2 log K + W 1 N 2 log(2πσ2 n) The naïve approach has O(N 3 ) complexity Solve SDE between time points (equivalent discrete time model): f i = A i 1 f i 1 + q i 1 ; q i 1 N (0, Q i 1 ) y i = Hf i + ɛ i ; ɛ n N (0, σ 2 n) Parameters of the discrete model: A i = A[ t i ] = e t i F, Q i = P A i P A i Inference and learning by Kalman FIlter (KF) and Rauch-Tung-Striebel (RTS) smoother in O(N) complexity 6/ 14

Inference and Learning with Gaussian likelihood Gaussian likelihood: P(y i f (t i )) = N (y i f (t i ), σ 2 ni) Posterior parameters: W = σ 2 I n Evidence: α = (K + W 1 ) 1 (y m) log Z GPR = 1 2 α (y m) 1 2 log K + W 1 N 2 log(2πσ2 n) The naïve approach has O(N 3 ) complexity Solve SDE between time points (equivalent discrete time model): f i = A i 1 f i 1 + q i 1 ; q i 1 N (0, Q i 1 ) y i = Hf i + ɛ i ; ɛ n N (0, σ 2 n) Parameters of the discrete model: A i = A[ t i ] = e t i F, Q i = P A i P A i Inference and learning by Kalman FIlter (KF) and Rauch-Tung-Striebel (RTS) smoother in O(N) complexity 6/ 14

log Z log Z KL = [ Fast computation of A i and Q i by interpolation Problem: 1 2 When there are many t i 2.9. Assumed density filtering (ADF) a.k.a. parameters computation can be slow Solution: α mvm K(α) + ld K(W) 2 i ψ : s e sx 0,..., 2 identical moments is smooth zi k = fi k Here, Q mapping, i(f i) = hence N (f m, K) j i interpolation (similar to KISS-GP [4]) Evaluate ψ on an equispaced log Z ADF = grid [ s 1, s 1 2,.., s K, where 2 s j = s 0 + j s l KL(f i) 2ρ KL ] where the remainder ρ KL = tr(w ld K(W)/ W) can be computed using computational primitive 4. single-sweep Expectation propagation (EP) In expectation propagation (EP) (Minka, 2001), the non- Gaussian likelihoods P(y i f i) are replaced by unnormalized Gaussians t i(f i) = exp(b if i W iifi 2 /2) and their parameters (b i, W ii) are iteratively (in multiple passes) updated such that Q i(f i)p(y i f i) and Q i(f i)t(f i) have k = Q i(fi)t(fi) dfi. tj(fj) df i denotes the cavity distribution. Unlike full state space EP using forward and backward passes (Heskes & Zoeter, 2002), there is a single-pass variant doing only one forward sweep that is know as assumed density filtering (ADF). It is very simple to implement in the GP setting. In fact, ADF is readily implemented by Algorithm 2 when the flag ADF is switched on. The marginal likelihood approximation takes the form α mvm K(α) + ld K(W) 2 i log z 0 i 2ρ ADF ] where the remainder ρ ADF collects a number of scalar terms depending on (b, W, α, m). 3. Experiments The experiments focus on showing that the state space formulation delivers the exactness of the full naïve solution, but with appealing computational benefits, and wide appli-,, Relative absolute difference in log Z 10 14 10 10 10 6 10 Use 4-point interpolation: A c 1 A j 1 + c 2 A j + c 3 A j+1 + c 4 A j+2. 0 20 40 60 80 100 Coefficients {c i } 4 120 i=1 are 140 160 Number of interpolation grid points, K efficiently computable Figure 1. Relative differences in log Z with different approximation grid sizes for Ai and Qi, K, of solving a GP regression problem. Results calculated over 20 independent repetitions, mean±min/max errors visualized. Evaluation time (s) 0 2 4 6 8 10 Naïve State space State space (K = 2000) State space (K = 10) 5 10 15 20 Number of training inputs, n 10 3 Figure 2. Empirical computational times of GP prediction using the GPML toolbox implementation as a function of number of training inputs, n, and degree of approximation, K. For all four methods the maximum absolute error in predicted Non-Gaussian means was State 10 9 Space. GPs Results calculated over ten independent runs. 7/ 14 noise: y i = 0.2 sin(2π t i + 2) + 0.5 sin(0.6π t i + 0.13) +

Non-Gaussian Likelihoods Addressing challenge 2 Posterior as a Gaussian approximation: Q(f D) = N ( f m + Kα, (K 1 + W) 1) Laplace approximation (LA) Variational Bayes (VB) Direct Kullback-Liebler minimization (KL) Assumed Density Filtering (ADF) a.k.a. single sweep Expectation Propagation (EP) Laplace Approximation log P(f D) log P(f y) + log P(f t) Find the mode ˆf of this function by Newton method Hessian at the mode ˆf is precision W = 2 log P(ˆf t) [ log Z LA = 1 α mvm 2 K (α) + ld K (W) 2 ] i log P(y i ˆf i ) 8/ 14

Computational Primitives The following computational primitives allow to cast the covariance approximation in more generic terms: Linear system solving: solve K (W, r) := (K + W 1 ) 1 r Matrix-vector multiplications: mvm K (r) := Kr Log-determinants: ld K (W) := log B with well-conditioned B = I + W 1 2 K W 1 2 Predictions need latent mean E[f ] and variance V[f ] 9/ 14

Tackling computational primitives Using state space from of temporal GPs SpInGP: The first two computational primitives are calculated using SpInGP [5] approach: Idea is: using state space form compose the inverse of the covariance matrix, which turns out to be block-tridiagonal KF and RTS Smoothing: The last two primitives are solved by Kalman filtering and RTS smoothing Predictions are computed by primitive 4 and then by propagation through likelihood Comments: Derivatives of computational primitives, required for learning, are computed in a similar way SpInGP involves computations with block-tridiagonal matrices. These computations are similar to KF and RTS smoothing (see [1] Appendix) 10/ 14

Experiments 2-3 Experiments are designed to emphasize the paper findings and statements 1. A robust regression (Student s t likelihood) study example with n = 34,154 observations 2. Numerical effects in non-gaussian likelihoods 11/ 14

Experiment 4 A new interesting data set with commercial airline accidents dates scraped from Wikipedia [6] Accidents over the time-span of 100 years, n = 35,959 days We model the accident intensity as a Log Gaussian Cox process (Poisson likelihood) The GP prior is set up as: k(t, t ) = k Mat. (t, t ) + k per. (t, t ) k Mat. (t, t ) 12/ 14

Conclusions This paper brings together research done in state space GPs and non-gaussian approximate inference We improve stability and provide additional speed-up by fast computations of the state space model parameters We provide unifying code for all approches in GPML toolbox v. 4.2 [7] Visit our poster: #151 13/ 14

References [1] H. Nickisch, A. Solin, and A. Grigorievskiy (2018). State Space Gaussian Processes with Non-Gaussian Likelihood. In ICML. [2] C.E. Rasmussen and C.K.I. Williams (2006). Gaussian Processes for Machine Learning. The MIT Press. [3] J. Hartikainen, and S. Särkkä (2010). Kalman filtering and smoothing solutions to temporal Gaussian process regression models. In MLSP. [4] A. G. Wilson, and H. Nickisch (2015). Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP). In ICML. [5] A. Grigorievskiy, N. Lawrence, and S. Särkkä (2017). Parallelizable Sparse Inverse Formulation Gaussian Processes (SpInGP). In MLSP. [6] Wikipedia (2018). URL https://en.wikipedia.org/wiki/list_of_ accidents_and_incidents_involving_commercial_aircraft [7] C. E. Rasmussen and H. Nickisch (2010). Gaussian Processes for Machine Learning (GPML). In GMLR. 14/ 14