Zig-Zag Monte Carlo. Delft University of Technology. Joris Bierkens February 7, 2017

Similar documents
The zig-zag and super-efficient sampling for Bayesian analysis of big data

Carlo. Correspondence: February 16, Abstract

Practical unbiased Monte Carlo for Uncertainty Quantification

arxiv: v1 [stat.co] 2 Nov 2017

Introduction to Machine Learning CMU-10701

Computational statistics

Advances and Applications in Perfect Sampling

Sequential Monte Carlo Samplers for Applications in High Dimensions

Riemann Manifold Methods in Bayesian Statistics

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Control Variates for Markov Chain Monte Carlo

Answers and expectations

CSC 2541: Bayesian Methods for Machine Learning

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

17 : Markov Chain Monte Carlo

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

On Markov chain Monte Carlo methods for tall data

Markov Chain Monte Carlo

Monte Carlo in Bayesian Statistics

Likelihood-free MCMC

Paul Karapanagiotidis ECO4060

Adaptive Monte Carlo methods

19 : Slice Sampling and HMC

16 : Markov Chain Monte Carlo (MCMC)

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Bayesian Inference and MCMC

Notes on pseudo-marginal methods, variational Bayes and ABC

Monte Carlo methods for sampling-based Stochastic Optimization

Approximate Bayesian Computation and Particle Filters

Deblurring Jupiter (sampling in GLIP faster than regularized inversion) Colin Fox Richard A. Norton, J.

Stat 516, Homework 1

Computer intensive statistical methods

Lecture 7 and 8: Markov Chain Monte Carlo

Inexact approximations for doubly and triply intractable problems

Sequential Monte Carlo Methods in High Dimensions

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture February Arnaud Doucet

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo methods

Pseudo-marginal MCMC methods for inference in latent variable models

Stat 451 Lecture Notes Monte Carlo Integration

STA 4273H: Statistical Machine Learning

Sampling Algorithms for Probabilistic Graphical models

Quantifying Uncertainty

Markov Chain Monte Carlo (MCMC)

An introduction to Sequential Monte Carlo

Markov Chain Monte Carlo methods

Computer intensive statistical methods

Asymptotics and Simulation of Heavy-Tailed Processes

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

6 Markov Chain Monte Carlo (MCMC)

Introduction to Rare Event Simulation

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Fast Maximum Likelihood estimation via Equilibrium Expectation for Large Network Data

An ABC interpretation of the multiple auxiliary variable method

A = {(x, u) : 0 u f(x)},

Gradient-based Monte Carlo sampling methods

Markov Chain Monte Carlo Methods

Learning Energy-Based Models of High-Dimensional Data

LECTURE 15 Markov chain Monte Carlo

Lecture 8: The Metropolis-Hastings Algorithm

Monte Carlo Methods. Leon Gu CSD, CMU

MCMC Sampling for Bayesian Inference using L1-type Priors

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

Kernel Adaptive Metropolis-Hastings

Lecture 8: Bayesian Estimation of Parameters in State Space Models

1 Geometry of high dimensional probability distributions

Inference in state-space models with multiple paths from conditional SMC

Retail Planning in Future Cities A Stochastic Dynamical Singly Constrained Spatial Interaction Model

Stochastic modelling of urban structure

Sampling Methods (11/30/04)

ST 740: Markov Chain Monte Carlo

arxiv: v1 [stat.co] 23 Nov 2016

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Surveying the Characteristics of Population Monte Carlo

Markov Chain Monte Carlo

On Bayesian Computation

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Markov Chain Monte Carlo, Numerical Integration

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Graphical Models and Kernel Methods

Variational Inference via Stochastic Backpropagation

Bayesian parameter estimation in predictive engineering

Weak convergence of Markov chain Monte Carlo II

Bayesian Methods for Machine Learning

Machine Learning. Probabilistic KNN.

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

A stochastic formulation of a dynamical singly constrained spatial interaction model

Patterns of Scalable Bayesian Inference Background (Session 1)

DAG models and Markov Chain Monte Carlo methods a short overview

Example: physical systems. If the state space. Example: speech recognition. Context can be. Example: epidemics. Suppose each infected

Kernel adaptive Sequential Monte Carlo

Markov Chains and MCMC

Introduction to Bayesian methods in inverse problems

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Evidence estimation for Markov random fields: a triply intractable problem

Transcription:

Zig-Zag Monte Carlo Delft University of Technology Joris Bierkens February 7, 2017 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 1 / 33

Acknowledgements Collaborators Andrew Duncan Paul Fearnhead Antonietta Mira Gareth oberts Financial support Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 2 / 33

Outline 1 Motivation: Markov Chain Monte Carlo 2 One-dimensional Zig-Zag process 3 Multi-dimensional ZZP 4 Subsampling 5 Doubly intractable likelihood Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 3 / 33

Bayesian inference In Bayesian inference we typically deal with a posterior density π(x) = π(x; y) L(y x)π 0 (x), x d, where L(y x) is the likelihood of the data y given parameter x d, and π 0 is a prior density for x. Quantities of interest are e.g. posterior mean xπ(x) dx, posterior variance x 2 π(x) dx ( xπ(x) dx ) 2, tail probability 1 {x c} π(x) dx. All of these involve integrals of the form h(x)π(x) dx. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 4 / 33

Evaluating h(x)π(x) dx Possible approaches: 1 Explicit (analytic) integration. arely possible Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

Evaluating h(x)π(x) dx Possible approaches: 1 Explicit (analytic) integration. arely possible 2 Numerical integration. Curse of dimensionality Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

Evaluating h(x)π(x) dx Possible approaches: 1 Explicit (analytic) integration. arely possible 2 Numerical integration. Curse of dimensionality 3 Monte Carlo. Draw independent samples (X 1, X 2,... ) from π and use the law of large numbers. equires independent samples from π 1 h(x)π(x) dx = lim K K K h(x k ). k=1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

Evaluating h(x)π(x) dx Possible approaches: 1 Explicit (analytic) integration. arely possible 2 Numerical integration. Curse of dimensionality 3 Monte Carlo. Draw independent samples (X 1, X 2,... ) from π and use the law of large numbers. equires independent samples from π 4 Markov Chain Monte Carlo. Construct an ergodic Markov chain (X 1, X 2,... ) with invariant distribution π(x) dx, use Birkhoff s ergodic theorem. 1 h(x)π(x) dx = lim K K K h(x k ). k=1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

One-dimensional Zig-Zag process Dynamics Continuous time Current state (X (t), Θ(t)) { 1, +1}. Move X (t) in direction Θ(t) = ±1 until a switch occurs. The switching intensity is λ(x (t), Θ(t)). 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 0 10 20 30 40 50 60 70 80 90 100 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 6 / 33

elation between switching rate and potential Lf (x, θ) = θ df + λ(x, θ)(f (x, θ) f (x, θ)), x, θ { 1, +1}. dx Potential U(x) = log π(x) Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

elation between switching rate and potential Lf (x, θ) = θ df + λ(x, θ)(f (x, θ) f (x, θ)), x, θ { 1, +1}. dx Potential U(x) = log π(x) π is invariant if and only if λ(x, +1) λ(x, 1) = U (x) for all x. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

elation between switching rate and potential Lf (x, θ) = θ df + λ(x, θ)(f (x, θ) f (x, θ)), x, θ { 1, +1}. dx Potential U(x) = log π(x) π is invariant if and only if λ(x, +1) λ(x, 1) = U (x) for all x. Equivalently, λ(x, θ) = γ(x) + max (0, θu (x)), γ(x) 0. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

elation between switching rate and potential Lf (x, θ) = θ df + λ(x, θ)(f (x, θ) f (x, θ)), x, θ { 1, +1}. dx Potential U(x) = log π(x) π is invariant if and only if λ(x, +1) λ(x, 1) = U (x) for all x. Equivalently, λ(x, θ) = γ(x) + max (0, θu (x)), γ(x) 0. Example: Gaussian distribution N (0, σ 2 ) Density π(x) exp( x 2 /(2σ 2 )) Potential U(x) = x 2 /(2σ 2 ) Derivative U (x) = x/σ 2 Switching rates λ(x, θ) = (θx/σ 2 ) + + γ(x) Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) +1) λ(x, 1) = U (x). Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). λ(x, θ) (f (x, θ) f (x, θ)) π(x) dx θ=±1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). λ(x, θ) (f (x, θ) f (x, θ)) π(x) dx θ=±1 = {λ(x, +1) (f (x, 1) f (x, +1)) + λ(x, 1) (f (x, +1) f (x, 1))} π(x) dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). λ(x, θ) (f (x, θ) f (x, θ)) π(x) dx θ=±1 = {λ(x, +1) (f (x, 1) f (x, +1)) + λ(x, 1) (f (x, +1) f (x, 1))} π(x) dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). λ(x, θ) (f (x, θ) f (x, θ)) π(x) dx θ=±1 = {λ(x, +1) (f (x, 1) f (x, +1)) + λ(x, 1) (f (x, +1) f (x, 1))} π(x) dx = (f (x, 1) f (x, +1))(λ(x, +1) λ(x, 1)) π(x) dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). λ(x, θ) (f (x, θ) f (x, θ)) π(x) dx θ=±1 = {λ(x, +1) (f (x, 1) f (x, +1)) + λ(x, 1) (f (x, +1) f (x, 1))} π(x) dx = (f (x, 1) f (x, +1))(λ(x, +1) λ(x, 1)) π(x) dx = (f (x, 1) f (x, +1))U (x)π(x) dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). λ(x, θ) (f (x, θ) f (x, θ)) π(x) dx θ=±1 = {λ(x, +1) (f (x, 1) f (x, +1)) + λ(x, 1) (f (x, +1) f (x, 1))} π(x) dx = (f (x, 1) f (x, +1))(λ(x, +1) λ(x, 1)) π(x) dx = (f (x, 1) f (x, +1))U (x)π(x) dx = (f (x, 1) f (x, +1))π (x) dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Proof of invariance of π exp( U) Lf (x, θ) = θ f (x, θ) + λ(x, θ) (f (x, θ) f (x, θ)), x λ(x, +1) λ(x, 1) = U (x). Markov semigroup P(t)f (x, θ) = E x,θ f (X (t), Θ(t)) π stationary means that P(t)f (x, θ)π(x) dx = f (x, θ)π(x) dx f D(L), t 0. θ=±1 θ=±1 Differentiating gives the equivalent condition: θ=±1 Lf (x, θ)π(x) dx = 0, f D(L). λ(x, θ) (f (x, θ) f (x, θ)) π(x) dx θ=±1 = {λ(x, +1) (f (x, 1) f (x, +1)) + λ(x, 1) (f (x, +1) f (x, 1))} π(x) dx = (f (x, 1) f (x, +1))(λ(x, +1) λ(x, 1)) π(x) dx = (f (x, 1) f (x, +1))U (x)π(x) dx = (f (x, 1) f (x, +1))π (x) dx = (f (x, 1) f (x, +1))π(x) dx = θ df (x, θ)π(x) dx. θ=±1 dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

Use in Monte Carlo (X (t), Θ(t)) t 0 has invariant distribution proportional to π(x). If ergodic, 1 T lim h(x (s)) ds = h(x)π(x) dx. T T 0 How to use in computations Either: Numerically integrate 1 T T 0 h(x s) ds for some finite T > 0, or Define (X 1, X 2,... ) by setting X k = X (k ) for some > 0; use as in traditional MCMC. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 9 / 33

CLT for the 1D Zig-Zag process [B., Duncan, Limit theorems for the Zig-Zag process, 2016] X (t) satisfies a Central Limit Theorem (CLT) for observable h if 1 T [h(x s ) E π h(x )] ds N (0, σh). 2 T Example: unimodal potential/density function X (t) 0 S + 1 S + 3 0 T 0 + T 1 T 1 + S 2 + T 2 T 2 + T 3 T 3 + S 1 S 2 S 3 t Say Y i = T + i h(x T + s ) ds. i 1 CLT for ZZP follows essentially from CLT for N(t) i=1 Y i. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 10 / 33

CLT for the 1D Zig-Zag process [B., Duncan, Limit theorems for the Zig-Zag process, 2016] General formula for asymptotic variance σh 2 = 2 (λ(x, +1) + λ(x, 1)) φ (x) 2 π(x) dx where L Langevin φ = h := h π(h). Langevin diffusion: σh 2 = 2 φ (x) 2 π(x) dx Cool results Computational efficiency for ZZP better than IID sampling for Gaussian (oscillatory ACF) Student-t distribution, ν degrees of freedom Langevin diffusion satisfies CLT for ν > 2 Zig-Zag process satisfies CLT for ν > 1. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 11 / 33

Multi-dimensional Zig-Zag process Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 12 / 33

Multi-dimensional Zig-Zag process Target π(x) = exp( U(x)) on d. Set of directions θ { 1, +1} d. Switching rates λ i (x, θ) = (θ i i U(x)) +, for i = 1,..., d. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

Multi-dimensional Zig-Zag process Target π(x) = exp( U(x)) on d. Set of directions θ { 1, +1} d. Switching rates λ i (x, θ) = (θ i i U(x)) +, for i = 1,..., d. Cool observation factorized target distribution π(x) = d i=1 π i(x i ) with π i (y) = exp( U i (y)). Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

Multi-dimensional Zig-Zag process Target π(x) = exp( U(x)) on d. Set of directions θ { 1, +1} d. Switching rates λ i (x, θ) = (θ i i U(x)) +, for i = 1,..., d. Cool observation factorized target distribution π(x) = d i=1 π i(x i ) with π i (y) = exp( U i (y)). Switching rates: λ i (x, θ) = (θ i U i (x i)) +. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

Multi-dimensional Zig-Zag process Target π(x) = exp( U(x)) on d. Set of directions θ { 1, +1} d. Switching rates λ i (x, θ) = (θ i i U(x)) +, for i = 1,..., d. Cool observation factorized target distribution π(x) = d i=1 π i(x i ) with π i (y) = exp( U i (y)). Switching rates: λ i (x, θ) = (θ i U i (x i)) +. Every component of the Zig-Zag process mixes at O(1). Compare to WM O (d), MALA O ( d 1/3), HMC O ( d 1/4). Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

Sampling x du dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 14 / 33

Sampling λ(x) = max ( ) 0, du dx x du dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 14 / 33

Sampling Λ(x) λ(x) = max ( ) 0, du dx x du dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 14 / 33

Sampling Λ(x) λ(x) = max ( ) 0, du dx T x ( draw P(T t) = exp ) t 0 Λ(X (s)) ds du dx accept T with probability λ(x (T ) Λ(X (T )) Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 14 / 33

Subsampling m(x) du 1 dx du 2 dx x du dx U = 1 2 (U 1 + U 2 ) Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 15 / 33

Subsampling Λ(x) λ 1 (x) λ 2 (x) du 1 dx du 2 dx x Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 15 / 33

Subsampling Λ(x) λ 1 (x) λ 2 (x) du 2 dx du 1 dx T ( draw P(T t) = exp ) t 0 Λ(X (s)) ds draw I from {1, 2} uniformly accept T with probability λ I (X (T )) Λ(X (T )) x Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 15 / 33

Subsampling Intractable likelihood, big data: U(x) = 1 n n i=1 U i(x). If π(x) n i=1 f (y i x)π 0 (x), take Theorem U i (x) = log π 0 (x) n log f (y i x). With subsampling, the Zig-Zag Process has exp( U) as invariant density. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 16 / 33

Subsampling Intractable likelihood, big data: U(x) = 1 n n i=1 U i(x). If π(x) n i=1 f (y i x)π 0 (x), take Theorem U i (x) = log π 0 (x) n log f (y i x). With subsampling, the Zig-Zag Process has exp( U) as invariant density. Proof: Effective switching rate is λ(x, θ) = 1 n n λ i (x, θ) = 1 n i=1 n (θu i (x)) +. i=1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 16 / 33

Subsampling Intractable likelihood, big data: U(x) = 1 n n i=1 U i(x). If π(x) n i=1 f (y i x)π 0 (x), take Theorem U i (x) = log π 0 (x) n log f (y i x). With subsampling, the Zig-Zag Process has exp( U) as invariant density. Proof: Effective switching rate is λ(x, θ) = 1 n λ i (x, θ) = 1 n (θu i (x)) +. n n i=1 i=1 { n } λ(x, +1) λ(x, 1) = 1 n (U i (x)) + ( U i (x)) + n i=1 i=1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 16 / 33

Subsampling Intractable likelihood, big data: U(x) = 1 n n i=1 U i(x). If π(x) n i=1 f (y i x)π 0 (x), take Theorem U i (x) = log π 0 (x) n log f (y i x). With subsampling, the Zig-Zag Process has exp( U) as invariant density. Proof: Effective switching rate is λ(x, θ) = 1 n λ i (x, θ) = 1 n (θu i (x)) +. n n i=1 i=1 { n } λ(x, +1) λ(x, 1) = 1 n (U i (x)) + ( U i (x)) + n = 1 n i=1 n { (U i (x)) + (U i (x)) } = 1 n i=1 i=1 n U i (x) = U (x). i=1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 16 / 33

Subsampling - scaling Without subsampling, O(n) computations per O(1) update Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

Subsampling - scaling Without subsampling, O(n) computations per O(1) update With naive subsampling, O(1) computations per O(1/n) update Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

Subsampling - scaling Without subsampling, O(n) computations per O(1) update With naive subsampling, O(1) computations per O(1/n) update Subsampling with control variates, O(1) computations per O(1) update: super-efficient. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

Subsampling - scaling Without subsampling, O(n) computations per O(1) update With naive subsampling, O(1) computations per O(1/n) update Subsampling with control variates, O(1) computations per O(1) update: super-efficient. The Control Variates approach depends on posterior contraction and requires finding a point close to the mode: O(n) start-up cost. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

Control variates U(x) = 1 n n i=1 U i(x) Let x denote (a point close to) the mode of the posterior distribution. Naive subsampling: λ i (x, θ) = (θu i (x))+. Control variates: λ i (x, θ) = (θ {U i (x) + U (x ) U i (x )}) +. If x is close to the mode then U i (x) U i (x ) is small (under assumptions on U) So each λ i (x, θ) is close to the ideal switching rate (θu (x)) +. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 18 / 33

100 observations Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 19 / 33

100 observations Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 20 / 33

100 observations Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 21 / 33

10,000 observations Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 22 / 33

10,000 observations Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 23 / 33

10,000 observations Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 24 / 33

Scaling in number of observations Zig-Zag, Zig-Zag w/subsampling, Zig-Zag w/control Variates, Zig-Zag with poor computational bound log(ess / epochs) base 2 8 6 4 2 0 2 6 7 8 9 10 log(number of observations) base 2 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 25 / 33

Scaling in number of observations Zig-Zag, Zig-Zag w/subsampling, Zig-Zag w/control Variates, Zig-Zag with poor computational bound log(ess / second) base 2 6 8 10 12 14 16 6 7 8 9 10 log(number of observations) base 2 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 26 / 33

Doubly intractable likelihood In many applications, the distribution of interest π has the following form. ( d ) exp i=1 x is i (y) π(x; y) = π 0 (x), x d, Z(y)M(x) where y {0, 1} n is a fixed observed realization of the forward model, ( d ) exp i=1 x is i (y) p(y x) =, M(x) s i, i = 1,..., d, are statistics which characterize the distribution of the forward model, with weights x 1,..., x d. Z(y) usual normalization constant Computational problem: Computation of M(x) is O(2 n ): M(x) = ( d ) x i s i (y). y {0,1} n exp i=1 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 27 / 33

Examples of doubly intractable likelihood p(y x) = Ising model (physics, image analysis) ( d ) exp i=1 x is i (y), x d, y {0, 1} n. M(x) s 1 (y) = y T Wy, where W is an interaction matrix s 2 (y) = h T y, where h represents an external magnetic field x 1, x 2 serve as inverse temperatures Exponential andom Graph Model random graphs over k vertices, with n := 1 2k(k 1) possible edges y 1,..., y n indicate the presence of an edge s 1 (y): number of edges in the random graph s 2 (y): e.g. number of triangles in the random graph Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 28 / 33

The Zig-Zag process applied to doubly intractable likelihood For simplicity, say x and ignore prior distribution. π(x; y) = so that exp (xs(y)) Z(y)M(x), M(x) = For the derivative of U we find z {0,1} n exp (xs(z)) x, y {0, 1} n, U(x) = log π(x; y) = xs(y) + log M(x). U (x) = s(y) + d log M(x) dx Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 29 / 33

The Zig-Zag process applied to doubly intractable likelihood For simplicity, say x and ignore prior distribution. π(x; y) = so that exp (xs(y)) Z(y)M(x), M(x) = For the derivative of U we find z {0,1} n exp (xs(z)) x, y {0, 1} n, U(x) = log π(x; y) = xs(y) + log M(x). U (x) = s(y) + d log M(x) dx z {0,1} n exp (xs(z)) s(z) = s(y) + M(x) Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 29 / 33

The Zig-Zag process applied to doubly intractable likelihood For simplicity, say x and ignore prior distribution. π(x; y) = so that exp (xs(y)) Z(y)M(x), M(x) = For the derivative of U we find z {0,1} n exp (xs(z)) x, y {0, 1} n, U(x) = log π(x; y) = xs(y) + log M(x). U (x) = s(y) + d log M(x) dx z {0,1} n exp (xs(z)) s(z) = s(y) + = s(y) + E x [s(y )], M(x) where Y is a realization of the forward model with parameter x. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 29 / 33

The Zig-Zag process applied to doubly intractable likelihood U (x) = s(y) + E x [s(y )]. Switching rate complexity O(2 n ). For x, θ { 1, +1}, λ(x, θ) = max(θu (x), 0) = max ( θs(y) + θe x [s(y )], 0). Idea: Use unbiased estimate of E x [s(y )] Crude algorithm for determining next switch: 1 Determine upper bound Λ(x) for λ(x, θ) 2 Generate switching ( time according to P(T t) = exp t ). 0 Λ(X (r)) dr d 3 Obtain unbiased estimate Ĝ of dx U(x) 4 Accept switch with probability max(0, θĝ)/λ(x (T )), otherwise repeat. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 30 / 33

Unbiased estimation of E x [s(y )] Two possible approaches: perfect sampling, coupling from the past (Propp, Wilson, 1996): use Glauber dynamics in ingenious way to obtain a sample Y which is distributed exactly according to the forward distribution π( x). Disadvantages: Not applicable to all discrete models Exponentially slow convergence in cold temperature regimes Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 31 / 33

Unbiased estimation of E x [s(y )] Two possible approaches: perfect sampling, coupling from the past (Propp, Wilson, 1996): use Glauber dynamics in ingenious way to obtain a sample Y which is distributed exactly according to the forward distribution π( x). Disadvantages: Not applicable to all discrete models Exponentially slow convergence in cold temperature regimes unbiased MCMC sampling (Glynn, hee, 2014): introduce N-valued random variable N. Define i := s(y i ) s(ỹi) where (Y i ) and (Ỹi) are two realizations of Glauber dynamics, correlated in a specific way. Unbiased estimate N i Ĝ = P(N i). i=0 Disadvantages: no global upper bound for estimate, variance may be extremely large. Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 31 / 33

Zig-Zag Process We can use piecewise deterministic Markov processes for sampling Unbiased estimate for the log density gradient results in correct invariant distribution. Significantly better scaling than IID sampling for big data Doubly intractable likelihood: work in progress Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 32 / 33

eferences B., oberts, A piecewise deterministic scaling limit of Lifted Metropolis-Hastings in the Curie-Weiss model, to appear in Annals of Applied Probability, 2015, https://arxiv.org/abs/1509.00302 B., Fearnhead, oberts, The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data, 2016, https://arxiv.org/abs/1607.03188 B., Duncan, Limit theorems for the Zig-Zag process, 2016, https://arxiv.org/abs/1607.08845 B., Fearnhead, Pollock, oberts, Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo, https://arxiv.org/abs/1611.07873 Thank you! Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 33 / 33