Notes on pseudo-marginal methods, variational Bayes and ABC

Similar documents
A Review of Pseudo-Marginal Markov Chain Monte Carlo

Bayesian Inference and MCMC

Density Estimation. Seungjin Choi

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Afternoon Meeting on Bayesian Computation 2018 University of Reading

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Patterns of Scalable Bayesian Inference Background (Session 1)

Markov Chain Monte Carlo

an introduction to bayesian inference

Learning the hyper-parameters. Luca Martino

On Markov chain Monte Carlo methods for tall data

Bayesian Regression Linear and Logistic Regression

Introduction to Bayesian inference

Bayesian Phylogenetics:

Lecture 7 and 8: Markov Chain Monte Carlo

Monte Carlo in Bayesian Statistics

CPSC 540: Machine Learning

Part 1: Expectation Propagation

Lecture 2: From Linear Regression to Kalman Filter and Beyond

arxiv: v1 [stat.ml] 6 Mar 2015

Riemann Manifold Methods in Bayesian Statistics

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Markov Chain Monte Carlo methods

Probabilistic Machine Learning

Controlled sequential Monte Carlo

Evidence estimation for Markov random fields: a triply intractable problem

Bayesian Deep Learning

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Outline Lecture 2 2(32)

An ABC interpretation of the multiple auxiliary variable method

Expectation Propagation for Approximate Bayesian Inference

Markov Chain Monte Carlo methods

Adaptive HMC via the Infinite Exponential Family

Answers and expectations

Introduction to Probabilistic Machine Learning

Principles of Bayesian Inference


Monte Carlo Inference Methods

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Basic math for biology

Bayesian Machine Learning

Approximate Bayesian inference

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

Variational Inference via Stochastic Backpropagation

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

TEORIA BAYESIANA Ralph S. Silva

Efficient Likelihood-Free Inference

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Stochastic modelling of urban structure

MCMC algorithms for fitting Bayesian models

STA 4273H: Statistical Machine Learning

Approximate Bayesian Computation: a simulation based approach to inference

Probabilistic Graphical Networks: Definitions and Basic Results

SMC 2 : an efficient algorithm for sequential analysis of state-space models

Inexact approximations for doubly and triply intractable problems

17 : Optimization and Monte Carlo Methods

DAG models and Markov Chain Monte Carlo methods a short overview

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

1 Bayesian Linear Regression (BLR)

Eco517 Fall 2013 C. Sims MCMC. October 8, 2013

Paul Karapanagiotidis ECO4060

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

List of projects. FMS020F NAMS002 Statistical inference for partially observed stochastic processes, 2016

Kernel adaptive Sequential Monte Carlo

LECTURE 15 Markov chain Monte Carlo

Lecture 13 Fundamentals of Bayesian Inference

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

CSC 2541: Bayesian Methods for Machine Learning

GAUSSIAN PROCESS REGRESSION

Tutorial on Probabilistic Programming with PyMC3

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Pseudo-marginal MCMC methods for inference in latent variable models

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Introduc)on to Bayesian Methods

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

ComputationalToolsforComparing AsymmetricGARCHModelsviaBayes Factors. RicardoS.Ehlers

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

STA 4273H: Sta-s-cal Machine Learning

Markov chain Monte Carlo methods in atmospheric remote sensing

Lecture : Probabilistic Machine Learning

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Inference in state-space models with multiple paths from conditional SMC

Principles of Bayesian Inference

GWAS IV: Bayesian linear (variance component) models

Introduction to Machine Learning

Computational Perception. Bayesian Inference

Principles of Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

MCMC notes by Mark Holder

Variational Autoencoder

One Pseudo-Sample is Enough in Approximate Bayesian Computation MCMC

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Markov Networks.

Computational statistics

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Transcription:

Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution defined by p(θ y) = p(y θ)p(θ), (1) for some parameters θ and observations y. Furthermore, assume that we can not directly evaluate p(y θ) but we have access to a function, ĝ(u; y, θ) (that we can evaluate pointwise), that takes some random numbers ( seed if you will) u r(u θ) and outputs an unbiased (and non-negative) estimate of p(y θ). This means E r(u θ) [ĝ(u; y, θ)] = ĝ(u; y, θ)r(u θ)du = p(y θ), which is the setting of the pseudo-marginal methods [Andrieu and Roberts, 2009]. Now we define a joint target distribution over θ and u as follows π(θ, u) =, (2) which we can see is proper (integrates to one) and has the posterior as its marginal distribution, i.e. π(θ) = du = p(y θ)p(θ) = p(θ y). This means that if we design (almost) any approximate inference method and apply it to (2), the marginal of that approximation will also be an approximation to the true target distribution. For example, if we design a Markov chain (θ i, u i ) that has (2) as its stationary distribution, we get an approximation to the posterior (1) by throwing away u i and keeping θ i. Another example would be to fit a variational approximation q(θ λ)q(u η, θ) and then maximize the evidence lower bound (ELBO) q(θ λ)q(u η, θ) log dθdu, q(θ λ)q(u η, θ) 1

with respect to the variational parameters λ, η. The easiest thing to do, since we do not necessarily know how to evaluate r(u θ) point-wise, is to set q(u η, θ) = r(u θ) and the ELBO simplifies ĝ(u; y, θ)p(θ) q(θ λ)r(u θ) log dθdu. q(θ λ) This is what is called Variational Bayes with Intractable Likelihood (VBIL) [Tran et al., 2015], leveraging pseudo-marginal ideas in variational inference. Note that this is really just standard variational Bayes on the extended space of θ and u with a specific choice of variational approximation. After optimizing with respect to λ we get an approximation to the true posterior (1), i.e. q(θ λ ). As for using Metropolis-Hastings we would need a proposal, e.g. q(θ θ)r(u θ ), and the acceptance step becomes minimum of 1 and ĝ(u ; y, θ )r(u θ )p(θ ) q(θ θ )r(u θ) q(θ θ)r(u θ ) = ĝ(u ; y, θ )p(θ ) q(θ θ ) ĝ(u; y, θ)p(θ) q(θ θ). We could also consider doing EP, INLA, Laplace approximation, sequential Monte Carlo, or any other type of approximate Bayesian inference that you can come up with, using (2) as our target distribution. The marginal distribution over θ would be an approximation to the original posterior. Approximate Bayesian Computation In approximate Bayesian computation we (typically) assume we can simulate from the likelihood for a given parameter value, however, we can not evaluate its density point-wise. We solve this by introducing a systematic bias into the model, replacing p(y θ) in (1) by p ɛ (y θ) = K ɛ (y, x)p(x θ)dx, where K ɛ is a kernel, e.g. exp { 1 (y x) 2}, and p(x θ) is our data simulator. 2ɛ 2 This means we are now actually interested in approximating the likelihoodfree (or ABC) posterior p LF (θ y) p ɛ (y θ)p(θ) = p(θ) K ɛ (y, x)p(x θ)dx. (3) This is a systematic error we make that no matter how well we approximate (3) it will never be exactly equal to the true posterior (1) for any non-zero ɛ. It might seem like we have not really gained anything because to evaluate p ɛ (y θ) point-wise we evaluate an intractable integral. Well, we can easily generate an unbiased estimate of it by simulating from x p(x θ) and returning 2

K ɛ (y, x). This is straightforward to extended to the many-sample case 1 N K ɛ (y, x i ), x i p(x θ), which fits very nicely within the pseudo-marginal framework! That is we can identify ĝ(u; y, θ) = 1 N K ɛ (y, x i ), u := x 1:N, r(u θ) = p(x i θ). However, note that now ĝ is an unbiased estimate of p ɛ (y θ) and not p(y θ). We can still use the pseudo-marginal framework with the understanding that our approximation will now have the likelihood-free posterior p LF (θ y) as its marginal. For completeness this is the new target distribution π(θ, u) = π(θ, x 1:N ) p(θ) 1 N K ɛ (y, x i ) p(x i θ). (4) Connection with Automatic Variational ABC We can now show the correctness of the approach taken by Moreno et al. [2016]. It can be interpreted as variational Bayes (i.e. VBIL) with the reparameterization trick on both q(θ λ) and r(u θ) for the target distribution in (4). Note that if we do not want to use q(u θ, η) = r(u θ), Moreno et al. [2016] do not in some of the experiments, we can work under the assumption that x = f(ω, θ), ω m(ω), i.e. reparameterization of p(x θ). We now maximize the following ELBO q(θ λ) [ q(ω i θ, η) ] log p(θ) 1 N N K ɛ(y, f(ω i, θ)) N m(ωi ) q(θ λ) dθdω 1:N, N q(ωi θ, η) and then apply the reparameterization trick on both q(θ λ) and q(ω i θ, η). We need to assume that we can evaluate (and differentiate) m(ω) point-wise. Again, it can be interpreted and motivated as standard variational Bayes on the extended space of θ and ω 1:N. The pseudo-marginal framework now says that q(θ λ) is an approximation for the likelihood-free posterior, see Tran et al. [2015] for details and assumptions for this to hold. Note that this is not exactly how they motivate it in the paper, however, it is how I would justify what they do. 3

Connection with Hamiltonian ABC Assume again that we can further reparameterize the simulator as x = f(ω, θ), ω m(ω), i.e. this is equivalent with x p(x θ). This means we can still use the pseudo-marginal framework but now on the extendend space (θ, ω 1:N ) with the target distribution π(θ, ω 1:N ) = p(θ) 1 N N K ɛ (y, f(ω i, θ)) p ɛ (y) m(ω i ), (5) and it retains the likelihood-free posterior (3) as its marginal over θ. In the Hamiltonian ABC paper [Meeds et al., 2015] they then alternate between sampling θ ω 1:N and w 1:N θ using HMC and another Markov chain method (flipping), respectively. That should mean we need to take gradients of the following with respect to parameters ( N ( θ log p(θ) + θ log K ɛ y, f(ω i, θ) )). (6) If we could calculate the above gradient exactly, which they assume in Moreno et al. [2016], we could just use standard HMC for updating θ ω 1:N. In fact conditional on ω 1:N there is no extra randomness arising from the likelihoodsimulator! So applying a finite difference approximation to the gradient of the second term in (6) introduces an error for sure, but not really any stochastic error for which we would need the stochastic gradient Monte Carlo framework. The noise I assumed came in through the use of SPSA to approximate the gradient, an almost unbiased estimate of the second gradient term above. Together with the stochastic gradient Monte Carlo methods described in the paper it should result in something that more (or less?) samples from the likelihood-free posterior (3). Another important point here is that (6) might seem to have an unbiased approximation of the ABC likelihood that we have to evaluate through a logarithm. However, because we are using the pseudo-marginal approach this is just part of our target distribution! References C. Andrieu and G. O. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 37(2), 04 2009. E. Meeds, R. Leenders, and M. Welling. Hamiltonian abc. In 31st Conference on Uncertainty in Artificial Intelligence, 2015. 4

A. Moreno, T. Adel, E. Meeds, J. M. Rehg, and M. Welling. Automatic Variational ABC. arxiv:1606.08549, June 2016. M.-N. Tran, D. J. Nott, and R. Kohn. Variational Bayes with Intractable Likelihood. arxiv:1503.08621, March 2015. 5