ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015

Point Processes Introduction and State of The Art Overview Our Contribution Roughly speaking, a point process is a collection of points on a domain whose number and positions are random, for example the Poisson process. Likelihood of a draw D = {s i } i [ 1,n ] from a Poisson process with intensity function λ : S R + : ( ) n L(λ; D) = exp λ(s)ds λ(s i ), S R d. S i=1

Existing Bayesian Approaches Overview Our Contribution GP based methods: Approximate the GP as piece-wise constant on a grid. Easy to implement, complexity in O(n log n) for uniform grids. Trade-off between accuracy, speed and numerical stability. Introduce latent/thinning points to get rid of the integral in the likelihood. Inference is exact. Complexity (resp. memory requirement) cubic (resp. squared) in the total number of points. Alternative Bayesian approach: Infer the normalized intensity function (which can be λ(s)ds S regarded as a pdf) using Dirichlet process mixtures of Betas. Exact inference and linear complexity. Harder to implement, does not generalize to d > 2, and harder to express prior knowledge such as periodicity and smoothness. λ(s)

Our Contribution Introduction and State of The Art Overview Our Contribution We show that fully tractable, exact MCMC inference can be achieved without introducing thinning points. Our approach has linear complexity and linear memory requirement. Our method is easy to implement and is empirically shown to be more accurate and orders of magnitude faster than competing models.

A New Perspective Introduction and State of The Art Making Tractable Bayesian Inference Paradigm shift: No need to put a functional prior on the intensity function! The likelihood solely depends on (λ(s 1 ),..., λ(s n), λ(s)ds), that S is on n + 1 variables ( λ(s)ds is simply another variable). S When a functional prior is put on λ the only piece of information that contributes to the learning procedure is the implied distribution over the above n + 1 variables. So why not put a finite-dimensional prior on the above directly? Link with Cox processes: A Cox process is a point process that, conditionally on some stochastic process, is a Poisson process. Question: for a given d + 1-dimensional probability distribution π with support in R +(d+1), is there always a positive-valued stochastic process λ indexed on S such that (λ(s 1 ),..., λ(s n), λ(s)ds) π? S Answer: Yes! In fact we can construct an infinite number of such stochastic processes that have almost surely C paths.

Making Tractable Bayesian Inference Coupling the Integral with Function Values How do we link S λ(s)ds and λ(s i) without a functional prior? In the most general setting, there is no need to: the value of the integral of a Lebesgue-measurable function does not depend on the values of the function at a finite number of points. In the particular case where we would like to express some smoothness prior knowledge, this can be achieved by postulating an appropriate copula between S λ(s)ds and λ(s i).

Scaling Up Inference Making Tractable Bayesian Inference We introduce k n support points {s j } j [ 1,k ] that we will select to yield an optimal coverage of the domain in a sense we will discuss later. We construct p ( λ(s 1 ),..., λ(s n ), λ(s 1 ),..., λ(s k ), S λ(s)ds) using Gaussian processes. We derive p ( λ(s 1 ),..., λ(s n ), λ(s 1 ),..., λ(s k ), S λ(s)ds, D). We marginalize analytically to obtain p (λ(s 1 ),..., λ(s k ), D). Exact MCMC is then performed on p (λ(s 1 ),..., λ(s k ), D), from which we deduce predictive values E (λ(s i ) D) and Var (λ(s i ) D) at data points using the laws of total expectation and variance. Time complexity and memory requirement are both linear in n.

Our Finite-Dimensional Prior Making Tractable Bayesian Inference Let λ be a log-centred Gaussian process. Our prior satisfies the following conditions: 1. (λ(s 1),..., λ(s k)) (λ (s 1),..., λ (s k)) 2. i j, λ(s i ) λ(s j ) {λ(s 1),..., λ(s k)} 3. i, λ(s i ) S λ(s)ds {λ(s 1),..., λ(s k)} 4. i, λ(s i ) {λ(s 1),..., λ(s k)} λ (s i ) {λ (s 1),..., λ (s k)} 5. S λ(s)ds {λ(s 1),..., λ(s k)} Gamma(α, β) 6. {E, Var} ( S λ(s)ds {λ(s 1),..., λ(s k)} ) = {E, Var} ( S λ (s)ds {λ (s 1),..., λ (s k)} ) Prior smoothness assumptions are expressed through 1. and the covariance function of log λ. The link between S λ(s)ds and λ(s i) is provided by 1., 2., 3., and 6. 5. is primarily chosen for conjugacy. p (λ(s 1 ),..., λ(s k ), D) is then easily derived using MGFs and MCMC steps are straightforward.

Making Tractable Bayesian Inference Inducing Points Selection: Intuition Intuition: E(log λ(s i ) D) = E ( ) ) E (log λ(s i ) {log λ(s j )}k j=1, D D Var(log λ(s i ) D) ( ) ) = E Var (log λ(s i ) {log λ(s j )}k j=1, D D ( ) ) + Var E (log λ(s i ) {log λ(s j )}k j=1, D D. We select inducing points D = {s j } j [ 1,k ] sequentially so as to maximize the expected total variance reduction: U(D ) = E θ ( n i=1 Var (log λ(s i ) θ) Var ( log λ(s i ) {λ(s j )}, θ) ), where the expectation is taken with respect to the prior over the hyper-parameters.

Making Tractable Bayesian Inference Inducing Points Selection: Algorithm Inputs: 0 < α 1, N, p θ Output: u f, D k = 0, u 0 = 0, D =, e = 1; Sample ( θ i ) N i=1 from p(θ); while e > α do k = k + 1; s k = argmax Ũ(D {s}); s S u k = Ũ(D {s k }); D = D {s k }; e = u k u k 1 u k ; end while Algorithm: Selection of inducing points.

Inducing Points Selection: 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 eg cm t b 0.1 0 5 10 15 20 25 30 Figure : Convergence of utility during inducing points selection.

Benchmarking on 1D synthetic dataset (1) 3.5 s 4 s 1 s 8 s 3 s 6 s 2 s 5 s 7 3.0 2.5 2.0 1.5 us Real SGCP RMP full RMP 1 DPMB 1.0 0.5 0.0 0 10 20 30 40 50 Figure : Posterior mean intensity comparison on a synthetic function.

Benchmarking on 1D synthetic dataset (2) MAE RMSE Time (s) ESS SGCP 0.31 0.37 257.72 ± 16.29 6 RMP 1 0.32 0.38 110.19 ± 7.37 23 RMP full 0.25 0.31 139.64 ± 5.24 6 DPMB 0.23 0.32 23.27 ± 0.94 47 Us 0.19 0.27 4.35 ± 0.12 38 Table : Accuracy, speed and effective sample size comparison on the synthetic dataset. Time and ESS are reported per 100 MCMC samples.

A Standard 1D Real-Life Dataset 5 s 3 s 1 s 4 s 5 s 2 4 3 2 1 0 1860 1880 1900 1920 1940 1960 Figure : Posterior mean intensity ±2σ on the coal-mine data-set. Red arrows indicate inducing points labeled in the order they were selected, blue dots are disaster occurrences.

A Standard 2D Real-Life Dataset 1.0 2200 0.8 1900 1600 0.6 1300 0.4 1000 700 0.2 400 0.0 0.0 0.2 0.4 0.6 0.8 1.0 100 Figure : Contour plot of the posterior mean intensity on the bramble cane dataset. Blue dots are spacial locations, red dots are inducing points.

A Big 1D Real-Life Dataset 40000 s 8 s 7 s 3 s 1 s 4 s 2 s 6 s 5 35000 30000 25000 20000 15000 10000 5000 0 0 5 10 15 20 25 Figure : Posterior mean intensity and selected inducing points on a Twitter dataset (approx. 190k tweets).

Summary Introduction and State of The Art In summary: No need to postulate a functional prior to perform tractable exact Bayesian inference on Poisson processes using GPs. Flexible finite dimensional priors can be constructed that yield accurate and computationally efficient inference methods. Thank you!