Approximate Bayesian Computation and Particle Filters

Similar documents
Bayesian Inference and MCMC

Sequential Monte Carlo Samplers for Applications in High Dimensions

Learning Energy-Based Models of High-Dimensional Data

Sequential Monte Carlo Methods for Bayesian Computation

Nonparametric Bayesian Methods (Gaussian Processes)

arxiv: v1 [stat.co] 1 Jan 2014

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

STA 4273H: Statistical Machine Learning

Fast Likelihood-Free Inference via Bayesian Optimization

Bayesian Regression Linear and Logistic Regression

Introduction to Machine Learning

Introduction to Restricted Boltzmann Machines

STA 4273H: Sta-s-cal Machine Learning

Towards inference for skewed alpha stable Levy processes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Answers and expectations

Sampling from complex probability distributions

Efficient Likelihood-Free Inference

Variational Autoencoder

Why do we care? Examples. Bayes Rule. What room am I in? Handling uncertainty over time: predicting, estimating, recognizing, learning

Computer Intensive Methods in Mathematical Statistics

CSC 2541: Bayesian Methods for Machine Learning

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Multifidelity Approaches to Approximate Bayesian Computation

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Computational statistics

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Machine Learning for OR & FE

Foundations of Nonparametric Bayesian Methods

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Default Priors and Effcient Posterior Computation in Bayesian

A short introduction to INLA and R-INLA

Linear Dynamical Systems

Stat 451 Lecture Notes Monte Carlo Integration

Variational Inference via Stochastic Backpropagation

Lecture : Probabilistic Machine Learning

STA 4273H: Statistical Machine Learning

Stochastic Analogues to Deterministic Optimizers

State-Space Methods for Inferring Spike Trains from Calcium Imaging

Approximate inference in Energy-Based Models

STA414/2104 Statistical Methods for Machine Learning II

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Making rating curves - the Bayesian approach

CPSC 540: Machine Learning

Time-Varying Parameters

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Approximate Bayesian computation: methods and applications for complex systems

SMC 2 : an efficient algorithm for sequential analysis of state-space models

Particle Filtering Approaches for Dynamic Stochastic Optimization

Bayesian Analysis for Natural Language Processing Lecture 2

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

STA 414/2104: Machine Learning

Intractable Likelihood Functions

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Stat 451 Lecture Notes Numerical Integration

Bayes: All uncertainty is described using probability.

Theory of Maximum Likelihood Estimation. Konstantin Kashin

An ABC interpretation of the multiple auxiliary variable method

Introduction to Machine Learning CMU-10701

MCMC algorithms for fitting Bayesian models

Graphical Models and Kernel Methods

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Zig-Zag Monte Carlo. Delft University of Technology. Joris Bierkens February 7, 2017

Likelihood-free inference and approximate Bayesian computation for stochastic modelling

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Phylogenetics

Expectation Propagation in Dynamical Systems

Contrastive Divergence

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

On Markov chain Monte Carlo methods for tall data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

Particle Filters: Convergence Results and High Dimensions

EnKF-based particle filters

Computer Intensive Methods in Mathematical Statistics

Why do we care? Measurements. Handling uncertainty over time: predicting, estimating, recognizing, learning. Dealing with time

An introduction to Sequential Monte Carlo

Bayesian Machine Learning - Lecture 7

Bayesian Methods: Naïve Bayes

ECE521 Lecture 19 HMM cont. Inference in HMM

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Probabilistic Graphical Models

Blind Equalization via Particle Filtering

COMP90051 Statistical Machine Learning

Approximate Bayesian computation: an application to weak-lensing peak counts

Monte Carlo Methods. Leon Gu CSD, CMU

Tutorial on ABC Algorithms

The Kalman Filter ImPr Talk

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Statistical Inference and Methods

An introduction to adaptive MCMC

A new iterated filtering algorithm

Transcription:

Approximate Bayesian Computation and Particle Filters Dennis Prangle Reading University 5th February 2014

Introduction Talk is mostly a literature review A few comments on my own ongoing research See Jasra Approximate Bayesian Computation for a Class of Time Series Models arxiv (2014) for a thorough overview

Background: ABC and state space models

Likelihood methods Statistical inference often based on the likelihood Probability (density) of data y obs given parameters θ Maximum likelihood: choose θ which maximises this Bayes: use likelihood and prior to form posterior parameter distribution Both rely on many evaluations of the likelihood for different θ

Likelihood-free methods For complex models likelihood evaluation can be too expensive/impossible (examples later) But often quick to simulate data given parameters Motivates likelihood-free methods based on simulation General idea: Simulate data y sim from many θs Select θs giving y sim y obs These methods typically give approximate results

Approximate Bayesian Computation (ABC) Puts likelihood-free idea into roughly Bayesian framework Simple algorithm is to repeat these steps N times Draw θ from prior Draw y sim θ from model Accept θ if d(y sim, y obs ) ɛ Output θs are sample from approximation to the posterior Approximation quality improves as ɛ 0 But sample size decreases Choice of ɛ is a trade-off

Curse of dimensionality / summary statistics Quality of results worsens for high dimensional data Can replace data y with low dimensional summary statistics S(y) i.e. accept if d(s(y sim ), S(y obs )) ɛ Some research in how to choose S well e.g. Fearnhead and Prangle (2012) But this adds further layers of approximation and tuning For state space models better approaches possible

State space models Assume there is a latent Markov chain X 1, X 2,..., X T Observations are Y 1, Y 2,..., Y T Y i is conditionally independent of everything given X i Informally: Y i depends only on X i Only Y i s observed. X i s known as hidden/latent states. Model sometimes referred to as a latent or hidden Markov model.

State space models Often there is an evolution density: π(x t+1 = x t+1 X t = x t, θ) = g(x t+1 x t, θ) And an observation density: π(y t = y t X t = x t, θ) = h(y t, x t, θ) (Dependence on t possible as well) Standard particle filter requires tractable observation density

State space model inference goals Parameter inference: θ y 1, y 2,..., y t (learn parameters) Filtering: x t y 1, y 2,..., y t (learn current state) Smoothing: x 1, x 2,..., x t y 1, y 2,..., y t (learn historic states) Prediction: x t+1 y 1, y 2,..., y t (learn future state)

Main example: alpha-stable model α-stable distribution is a model for heavy-tailed data Often used in finance Key parameter is α Valid range is 0 < α 2 α = 2 is normal distribution Smaller α gives heavier tails e.g. α = 1 gives Cauchy distribution For most α values the density does not have a closed form Distribution also has location and scale parameters

Main example: alpha-stable model Latent states and observation both scalars Some Markov model for latent states Observation Y i is X i plus α-stable draw Application: daily log returns of a stock X i is state of economy Y i is X i plus short term variation - sometimes very large! Simplest case is X i constant - iid data

Main example: plot y 10 5 0 5 10 15 20 0 20 40 60 80 100 Time

Other examples (X t, Y t ) together form a Markov process X t is unobserved Y t is observed exactly This can be put into state space form Observation density often not tractable Several applications e.g. chemical reactions, infectious diseases, population dynamics

ABC filtering

ABC particle filter Input: parameters θ, bandwidth ɛ > 0, number of particles N, data y1 obs,..., yt obs 1 Initialise: let t = 1. 2 Sample x (i) values from prior for 1 i N 3 Simulate observations y (i) from π(y x (i), θ) 4 Accept/reject Let w (i) = { 1 for y obs t y (i) 2 ɛ, 0 otherwise. 5 Increment t, resample and propagate particles. Return to step 3.

Variations Alternative distance metric Smooth weights Summary statistics/data transformations More later

Filtering/smoothing output Consider computing some function of states (e.g. mean) Can compare algorithm output estimate with true value Filtering problem: error is O(ɛ), not affected by amount of data Smoothing problem: error is O(T ɛ) Strong assumptions required Error can be controlled by increasing computational effort

Likelihood estimate Recall likelihood is density of observations given θ Can get an estimate from particle filter In ABC this is product of acceptance frequencies (and constant) i.e. C(ɛ) N(1) acc N N (2) acc N... Converges to the true value for N, ɛ appropriately (Lebesgue differentiation theorem)

Degeneracy A problem with ABC particle filtering is degeneracy Zero acceptances at some iteration means algorithm cannot continue Smooth weights not much help Possible solution later

Parameter inference

Strategy Get likelihood estimates at various θs using ABC PF Use to approximate maximum likelihood estimator Or to construct posterior distribution of θ

Bayesian approach Use an MCMC algorithm on θ For each θ proposed, make an associated likelihood estimate ˆL(θ) The stationary distribution is prior(θ) E[ˆL(θ) θ] This is posterior distribution if ˆL(θ) unbiased Not true for ABC so get an approximate posterior

Maximum likelihood approach Stochastic gradient ascent algorithms Given θ estimate gradient of log likelihood This can be done from ABC output (various papers) Move along direction of greatest increase Several papers on implementation details This maximises E[ˆL(θ) θ]

Comparison of methods MLE approach faster MLE approach can be updated quickly as new data arrive Bayesian approach gives uncertainty quantification Both currently require considerable tuning and assumptions

Approximation quality Both approaches effectively use an approximate likelihood L ɛ (θ) = E[ˆL(θ) θ] With MLE θ ɛ Typically not equal to correct MLE, θ But as ɛ 0, θ ɛ θ

Consistency Consider ɛ > 0 and T i.e. large amount of data Under some conditions true MLE converges to true parameter values ( consistency ) And θ ɛ converges to a nearby value

Noisy ABC: consistency Noisy ABC adds iid noise to the observations before performing ABC inference From a particular noise observation (see next slide) Achieves consistency: lim T θ ɛ is correct even for ɛ > 0! With enough data this approx method learns true parameters But adding noise increases variance

Noisy ABC: theory ABC likelihood can be shown to be likelihood of a perturbed model Perturbation is adding measurement error Error distribution is uniform with radius ɛ Noisy ABC adds iid noise from same dist so that observations are now model+noise And the perturbed model is in fact correct This gives consistency

Avoiding degeneracy

Alive particle filter: motivation Consider an iteration of ABC PF with data y i N simulations performed Degeneracy if all rejected Alive PF uses adaptive N Keeps simulating until M acceptances (M prespecified) Avoids degeneracy, has random run-time

Alive particle filter: technical details Care is needed in alive PF to get good likelihood estimate properties But details not needed in this talk See Jasra et al (2013)

Efficiency comparison Work in progress! Comparing number of simulations required For standard ABC PF, this is number needed to avoid degeneracy with given prob Standard ABC PF cost is at best O(T log T ) Alive PF cost is at best O(T )

Efficiency comparison: poor conditions Both PFs have much higher cost in following situation Tails of observations are heavy in comparison with tail of proposals

Efficiency comparison: improving performance Transforming data to avoid heavy tails helpful Modify alive PF to quit early once we know likelihood estimate is low Paper later in year hopefully!

Conclusion

Summary 1 ABC allows inference based on model simulations only Results are approximate Summary statistics often necessary for high dim data, but add further approximation For simple state space models ABC particle filter can avoid summary stats Analyses each data point in sequence

Summary 2 Filtering, smoothing, parameter inference discussed Theory on convergence for ɛ 0 Noisy ABC allow consistent estimates Alive PF avoid degeneracy problems

Future directions Tuning - in particular ɛ Improving noisy ABC (e.g. multiple noise realisations) Faster algorithms More general theory - lots of assumptions currently needed Higher dimensional data Choosing between models Alternative likelihood-free approaches (e.g. via expectation propagation)