ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Similar documents
Non-Parametric Bayes

Nonparametric Bayesian Methods - Lecture I

Bayesian Methods for Machine Learning

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Tractable Nonparametric Bayesian Inference in Poisson Processes with Gaussian Process Intensities

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Statistical Machine Learning

Model Based Clustering of Count Processes Data

CPSC 540: Machine Learning

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Density Estimation. Seungjin Choi

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Bayesian Machine Learning

Introduction to Probabilistic Machine Learning

Foundations of Nonparametric Bayesian Methods

Introduction. Chapter 1

Integrated Non-Factorized Variational Inference

Model Selection for Gaussian Processes

STA 4273H: Statistical Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture : Probabilistic Machine Learning

COMP90051 Statistical Machine Learning

Bayesian Nonparametrics

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayesian inference J. Daunizeau

STAT Advanced Bayesian Inference

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bayesian Aggregation for Extraordinarily Large Dataset

Computational statistics

The Metropolis-Hastings Algorithm. June 8, 2012

The Bayesian approach to inverse problems

Exponential Families

Lecture 13 : Variational Inference: Mean Field Approximation

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Bayesian Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Machine learning: Hypothesis testing. Anders Hildeman

Bayesian Methods: Naïve Bayes

Study Notes on the Latent Dirichlet Allocation

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Lecture 3a: Dirichlet processes

PMR Learning as Inference

A Process over all Stationary Covariance Kernels

Introduction to Gaussian Process

STA 414/2104, Spring 2014, Practice Problem Set #1

Neutron inverse kinetics via Gaussian Processes

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Assessment of uncertainty in computer experiments: from Universal Kriging to Bayesian Kriging. Céline Helbert, Delphine Dupuy and Laurent Carraro

STA414/2104 Statistical Methods for Machine Learning II

Sequential Monte Carlo Methods for Bayesian Computation

13: Variational inference II

Lecture 6: Graphical Models: Learning

GWAS V: Gaussian processes

Learning the hyper-parameters. Luca Martino

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Foundations of Statistical Inference

Lecture 9. Time series prediction

Understanding Covariance Estimates in Expectation Propagation

STA 4273H: Statistical Machine Learning

Approximate Bayesian Computation and Particle Filters

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Bayesian Learning (II)

Machine Learning Summer School

Contrastive Divergence

Bayesian Nonparametrics: Dirichlet Process

Gaussian Mixture Model

Bayesian inference J. Daunizeau

Modelling geoadditive survival data

Clustering bi-partite networks using collapsed latent block models

Spatial Bayesian Nonparametrics for Natural Image Segmentation

GAUSSIAN PROCESS REGRESSION

Tutorial on Approximate Bayesian Computation

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian Machine Learning

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Expectation Propagation for Approximate Bayesian Inference

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Bayesian Networks in Educational Assessment

Classical and Bayesian inference

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

STAT 518 Intro Student Presentation

Recent Advances in Bayesian Inference Techniques

CS-E3210 Machine Learning: Basic Principles

Gentle Introduction to Infinite Gaussian Mixture Modeling

Gaussian Processes (10/16/13)

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Probabilistic numerics for deep learning

Bayesian Support Vector Machines for Feature Ranking and Selection

Unsupervised Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Large-scale Ordinal Collaborative Filtering

Introduction to Gaussian Processes

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Generative Models and Stochastic Algorithms for Population Average Estimation and Image Analysis

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Transcription:

ICML 2015 Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Machine Learning Research Group and Oxford-Man Institute University of Oxford July 8, 2015

Point Processes Introduction and State of The Art Overview Our Contribution Roughly speaking, a point process is a collection of points on a domain whose number and positions are random, for example the Poisson process. Likelihood of a draw D = {s i } i [ 1,n ] from a Poisson process with intensity function λ : S R + : ( ) n L(λ; D) = exp λ(s)ds λ(s i ), S R d. S i=1

Existing Bayesian Approaches Overview Our Contribution GP based methods: Approximate the GP as piece-wise constant on a grid. Easy to implement, complexity in O(n log n) for uniform grids. Trade-off between accuracy, speed and numerical stability. Introduce latent/thinning points to get rid of the integral in the likelihood. Inference is exact. Complexity (resp. memory requirement) cubic (resp. squared) in the total number of points. Alternative Bayesian approach: Infer the normalized intensity function (which can be λ(s)ds S regarded as a pdf) using Dirichlet process mixtures of Betas. Exact inference and linear complexity. Harder to implement, does not generalize to d > 2, and harder to express prior knowledge such as periodicity and smoothness. λ(s)

Our Contribution Introduction and State of The Art Overview Our Contribution We show that fully tractable, exact MCMC inference can be achieved without introducing thinning points. Our approach has linear complexity and linear memory requirement. Our method is easy to implement and is empirically shown to be more accurate and orders of magnitude faster than competing models.

A New Perspective Introduction and State of The Art Making Tractable Bayesian Inference Paradigm shift: No need to put a functional prior on the intensity function! The likelihood solely depends on (λ(s 1 ),..., λ(s n), λ(s)ds), that S is on n + 1 variables ( λ(s)ds is simply another variable). S When a functional prior is put on λ the only piece of information that contributes to the learning procedure is the implied distribution over the above n + 1 variables. So why not put a finite-dimensional prior on the above directly? Link with Cox processes: A Cox process is a point process that, conditionally on some stochastic process, is a Poisson process. Question: for a given d + 1-dimensional probability distribution π with support in R +(d+1), is there always a positive-valued stochastic process λ indexed on S such that (λ(s 1 ),..., λ(s n), λ(s)ds) π? S Answer: Yes! In fact we can construct an infinite number of such stochastic processes that have almost surely C paths.

Making Tractable Bayesian Inference Coupling the Integral with Function Values How do we link S λ(s)ds and λ(s i) without a functional prior? In the most general setting, there is no need to: the value of the integral of a Lebesgue-measurable function does not depend on the values of the function at a finite number of points. In the particular case where we would like to express some smoothness prior knowledge, this can be achieved by postulating an appropriate copula between S λ(s)ds and λ(s i).

Scaling Up Inference Making Tractable Bayesian Inference We introduce k n support points {s j } j [ 1,k ] that we will select to yield an optimal coverage of the domain in a sense we will discuss later. We construct p ( λ(s 1 ),..., λ(s n ), λ(s 1 ),..., λ(s k ), S λ(s)ds) using Gaussian processes. We derive p ( λ(s 1 ),..., λ(s n ), λ(s 1 ),..., λ(s k ), S λ(s)ds, D). We marginalize analytically to obtain p (λ(s 1 ),..., λ(s k ), D). Exact MCMC is then performed on p (λ(s 1 ),..., λ(s k ), D), from which we deduce predictive values E (λ(s i ) D) and Var (λ(s i ) D) at data points using the laws of total expectation and variance. Time complexity and memory requirement are both linear in n.

Our Finite-Dimensional Prior Making Tractable Bayesian Inference Let λ be a log-centred Gaussian process. Our prior satisfies the following conditions: 1. (λ(s 1),..., λ(s k)) (λ (s 1),..., λ (s k)) 2. i j, λ(s i ) λ(s j ) {λ(s 1),..., λ(s k)} 3. i, λ(s i ) S λ(s)ds {λ(s 1),..., λ(s k)} 4. i, λ(s i ) {λ(s 1),..., λ(s k)} λ (s i ) {λ (s 1),..., λ (s k)} 5. S λ(s)ds {λ(s 1),..., λ(s k)} Gamma(α, β) 6. {E, Var} ( S λ(s)ds {λ(s 1),..., λ(s k)} ) = {E, Var} ( S λ (s)ds {λ (s 1),..., λ (s k)} ) Prior smoothness assumptions are expressed through 1. and the covariance function of log λ. The link between S λ(s)ds and λ(s i) is provided by 1., 2., 3., and 6. 5. is primarily chosen for conjugacy. p (λ(s 1 ),..., λ(s k ), D) is then easily derived using MGFs and MCMC steps are straightforward.

Making Tractable Bayesian Inference Inducing Points Selection: Intuition Intuition: E(log λ(s i ) D) = E ( ) ) E (log λ(s i ) {log λ(s j )}k j=1, D D Var(log λ(s i ) D) ( ) ) = E Var (log λ(s i ) {log λ(s j )}k j=1, D D ( ) ) + Var E (log λ(s i ) {log λ(s j )}k j=1, D D. We select inducing points D = {s j } j [ 1,k ] sequentially so as to maximize the expected total variance reduction: U(D ) = E θ ( n i=1 Var (log λ(s i ) θ) Var ( log λ(s i ) {λ(s j )}, θ) ), where the expectation is taken with respect to the prior over the hyper-parameters.

Making Tractable Bayesian Inference Inducing Points Selection: Algorithm Inputs: 0 < α 1, N, p θ Output: u f, D k = 0, u 0 = 0, D =, e = 1; Sample ( θ i ) N i=1 from p(θ); while e > α do k = k + 1; s k = argmax Ũ(D {s}); s S u k = Ũ(D {s k }); D = D {s k }; e = u k u k 1 u k ; end while Algorithm: Selection of inducing points.

Inducing Points Selection: 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 eg cm t b 0.1 0 5 10 15 20 25 30 Figure : Convergence of utility during inducing points selection.

Benchmarking on 1D synthetic dataset (1) 3.5 s 4 s 1 s 8 s 3 s 6 s 2 s 5 s 7 3.0 2.5 2.0 1.5 us Real SGCP RMP full RMP 1 DPMB 1.0 0.5 0.0 0 10 20 30 40 50 Figure : Posterior mean intensity comparison on a synthetic function.

Benchmarking on 1D synthetic dataset (2) MAE RMSE Time (s) ESS SGCP 0.31 0.37 257.72 ± 16.29 6 RMP 1 0.32 0.38 110.19 ± 7.37 23 RMP full 0.25 0.31 139.64 ± 5.24 6 DPMB 0.23 0.32 23.27 ± 0.94 47 Us 0.19 0.27 4.35 ± 0.12 38 Table : Accuracy, speed and effective sample size comparison on the synthetic dataset. Time and ESS are reported per 100 MCMC samples.

A Standard 1D Real-Life Dataset 5 s 3 s 1 s 4 s 5 s 2 4 3 2 1 0 1860 1880 1900 1920 1940 1960 Figure : Posterior mean intensity ±2σ on the coal-mine data-set. Red arrows indicate inducing points labeled in the order they were selected, blue dots are disaster occurrences.

A Standard 2D Real-Life Dataset 1.0 2200 0.8 1900 1600 0.6 1300 0.4 1000 700 0.2 400 0.0 0.0 0.2 0.4 0.6 0.8 1.0 100 Figure : Contour plot of the posterior mean intensity on the bramble cane dataset. Blue dots are spacial locations, red dots are inducing points.

A Big 1D Real-Life Dataset 40000 s 8 s 7 s 3 s 1 s 4 s 2 s 6 s 5 35000 30000 25000 20000 15000 10000 5000 0 0 5 10 15 20 25 Figure : Posterior mean intensity and selected inducing points on a Twitter dataset (approx. 190k tweets).

Summary Introduction and State of The Art In summary: No need to postulate a functional prior to perform tractable exact Bayesian inference on Poisson processes using GPs. Flexible finite dimensional priors can be constructed that yield accurate and computationally efficient inference methods. Thank you!