Lecture 8: Bayesian Estimation of Parameters in State Space Models

Similar documents
Lecture 2: From Linear Regression to Kalman Filter and Beyond

19 : Slice Sampling and HMC

F denotes cumulative density. denotes probability density function; (.)

Learning the hyper-parameters. Luca Martino

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

An introduction to Sequential Monte Carlo

Calibration of Stochastic Volatility Models using Particle Markov Chain Monte Carlo Methods

Lecture 7: Optimal Smoothing

Introduction to Machine Learning

Bayesian Inference and MCMC

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Computer Intensive Methods in Mathematical Statistics

an introduction to bayesian inference

17 : Markov Chain Monte Carlo

Markov Chain Monte Carlo methods

Introduction to Computational Biology Lecture # 14: MCMC - Markov Chain Monte Carlo

Lecture 6: Bayesian Inference in SDE Models

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Unsupervised Learning

Lecture 6: Multiple Model Filtering, Particle Filtering and Other Approximations

Integrated Non-Factorized Variational Inference

CSC 2541: Bayesian Methods for Machine Learning

MCMC algorithms for fitting Bayesian models

MCMC Methods: Gibbs and Metropolis

On-Line Parameter Estimation in General State-Space Models

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Computational statistics

(Extended) Kalman Filter

Bayesian Estimation with Sparse Grids

MCMC and Gibbs Sampling. Kayhan Batmanghelich

Controlled sequential Monte Carlo

STA 4273H: Sta-s-cal Machine Learning

Monte Carlo in Bayesian Statistics

Markov Chain Monte Carlo Methods for Stochastic Optimization

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Particle Filtering Approaches for Dynamic Stochastic Optimization

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Introduction to Machine Learning CMU-10701

Adaptive HMC via the Infinite Exponential Family

Introduction to Particle Filters for Data Assimilation

Monte Carlo Methods. Leon Gu CSD, CMU

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

ST 740: Markov Chain Monte Carlo

Answers and expectations

An Introduction to Expectation-Maximization

Bayesian Methods for Machine Learning

Density Estimation. Seungjin Choi

Probabilistic Machine Learning

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Bayesian Regression Linear and Logistic Regression

Notes on pseudo-marginal methods, variational Bayes and ABC

Particle Filtering for Data-Driven Simulation and Optimization

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Principles of Bayesian Inference

Markov Chain Monte Carlo Methods for Stochastic

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Variational Scoring of Graphical Model Structures

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

SMC 2 : an efficient algorithm for sequential analysis of state-space models

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Foundations of Statistical Inference

Riemann Manifold Methods in Bayesian Statistics

Markov Chain Monte Carlo, Numerical Integration

Linear Dynamical Systems (Kalman filter)

Markov chain Monte Carlo

An Introduction to Particle Filtering

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Reminder of some Markov Chain properties:

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Basic math for biology

CS281A/Stat241A Lecture 22

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Blind Equalization via Particle Filtering

1 Geometry of high dimensional probability distributions

Advanced Machine Learning

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Part 1: Expectation Propagation

MCMC: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo

Introduction to Bayesian Computation

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Estimating Gaussian Mixture Densities with EM A Tutorial

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Monte Carlo Inference Methods

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Bayesian Inference for DSGE Models. Lawrence J. Christiano

BAYESIAN ESTIMATION OF TIME-VARYING SYSTEMS:

Computer Practical: Metropolis-Hastings-based MCMC

Markov chain Monte Carlo

Robert Collins CSE586, PSU Intro to Sampling Methods

The Expectation-Maximization Algorithm

Transcription:

in State Space Models March 30, 2016

Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space models 4 Summary

Batch Bayesian estimation of parameters State space model with unnown parameters θ R d : θ p(θ) x 0 p(x 0 θ) x p(x x 1, θ) y p(y x, θ). The full posterior, in principle, can be computed as p(x 0:T, θ y 1:T ) = p(y 1:T x 0:T, θ) p(x 0:T θ) p(θ). p(y 1:T ) The marginal posterior of parameters is then p(θ y 1:T ) = p(x 0:T, θ y 1:T ) dx 0:T.

Batch Bayesian estimation of parameters (cont.) Advantages: A simple static Bayesian model. We can tae any numerical method (e.g., MCMC) to attac the model. Disadvantages: We are not utilizing the Marov structure of the model. Dimensionality is huge, computationally very challenging. Hard to utilize the already developed approximations for filters and smoothers. Requires computation of high-dimensional integral over the state trajectories. For computational reasons, we will select another, filtering and smoothing based route.

Filtering-based Bayesian estimation of parameters [1/3] Directly approximate the marginal posterior distribution: p(θ y 1:T ) p(y 1:T θ) p(θ) The ey is the prediction error decomposition: p(y 1:T θ) = T p(y y 1: 1, θ) Lucily, the Bayesian filtering equations allow us to compute p(y y 1: 1, θ) efficiently.

Filtering-based Bayesian estimation of parameters [2/3] Recall that the prediction step of the Bayesian filtering equations computes p(x y 1: 1, θ) Using the conditional independence of measurements we get: p(y, x y 1: 1, θ) = p(y x, y 1: 1, θ) p(x y 1: 1, θ) = p(y x, θ) p(x y 1: 1, θ). Integration over x thus gives p(y y 1: 1, θ) = p(y x, θ) p(x y 1: 1, θ) dx

Filtering-based Bayesian estimation of parameters [3/3] Recursion for marginal lielihood of parameters The marginal lielihood of parameters is given by p(y 1:T θ) = T p(y y 1: 1, θ) where the terms can be solved via the recursion p(x y 1: 1, θ) = p(x x 1, θ) p(x 1 y 1: 1, θ) dx 1 p(y y 1: 1, θ) = p(y x, θ) p(x y 1: 1, θ) dx p(x y 1:, θ) = p(y x, θ) p(x y 1: 1, θ). p(y y 1: 1, θ)

Energy function Once we have the lielihood p(y 1:T θ) we can compute the posterior via p(θ y 1:T ) = p(y 1:T θ) p(θ) p(y1:t θ) p(θ) dθ The normalization constant in the denominator is irrelevant and it is often more convenient to wor with p(θ y 1:T ) = p(y 1:T θ) p(θ) For numerical reasons it is better to wor with the logarithm of the above unnormalized distribution. The negative logarithm is the energy function: ϕ T (θ) = log p(y 1:T θ) log p(θ).

Energy function (cont.) The posterior distribution can be recovered via p(θ y 1:T ) exp( ϕ T (θ)). ϕ T (θ) is called energy function, because in physics, the above corresponds to the probability density of a system with energy ϕ T (θ). The energy function can be evaluated recursively as follows: Start from ϕ 0 (θ) = log p(θ). At each step = 1, 2,..., T compute the following: ϕ (θ) = ϕ 1 (θ) log p(y y 1: 1, θ) For linear models, we can evaluate the energy function exactly with help of Kalman filter. In non-linear models we can use Gaussian filters or particle filters for approximating the energy function.

Maximum a posteriori approximations The maximum a posteriori (MAP) estimate: ˆθ MAP = arg max [p(θ y 1:T )]. θ Can be equivalently computed as ˆθ MAP = arg min θ [ϕ T (θ)], The maximum lielihood (ML) estimate of the parameter is a MAP estimate with a formally uniform prior p(θ) 1. The minimum (or maximum) can be found by using various gradient-free or gradient-based optimization methods. Gradients can be computed by recursive equations called sensitivity equations or sometimes by using Fisher s identity.

Laplace approximations The MAP estimate corresponds to a Dirac delta function approximation to the posterior distribution p(θ y 1:T ) δ(θ ˆθ MAP ), Ignores the spread of the distribution completely. Idea of Laplace approximation is to form a Gaussian approximation to the posterior distribution: p(θ y 1:T ) N(θ ˆθ MAP, [H(ˆθ MAP )] 1 ), where H(ˆθ MAP ) is the Hessian matrix of the energy function evaluated at the MAP estimate.

Marov chain Monte Carlo (MCMC) Marov chain Monte Carlo (MCMC) methods are algorithms for drawing samples from p(θ y 1:T ). Based on simulating a Marov chain which has the distribution p(θ y 1:T ) as its stationary distribution. The Metropolis Hastings (MH) algorithm uses a proposal density q(θ (i) θ (i 1) ) for suggesting new samples θ (i) given the previous ones θ (i 1). Gibbs sampling samples components of the parameters one at a time from their conditional distributions given the other parameters. Adaptive MCMC methods are based on adapting the proposal density q(θ (i) θ (i 1) ) based on past samples. Hamiltonian Monte Carlo (HMC) or hybrid Monte Carlo (HMC) method simulates a physical system to construct an efficient proposal distribution.

Metropolis Hastings Metropolis Hastings Draw the starting point, θ (0) from an arbitrary initial distribution. For i = 1, 2,..., N do 1 Sample a candidate point θ q(θ θ (i 1) ). 2 Evaluate the acceptance probability { } α i = min 1, exp(ϕ T (θ (i 1) ) ϕ T (θ )) q(θ(i 1) θ ) q(θ θ (i 1) ). 3 Generate a uniform random variable u U(0, 1) and set θ (i) = { θ, θ (i 1), if u α i otherwise.

Expectation maximization (EM) algorithm [1/5] Expectation maximization (EM) is an algorithm for computing ML and MAP estimates of parameters when direct optimization is not feasible. Let q(x 0:T ) be an arbitrary probability density over the states, then we have the inequality log p(y 1:T θ) F[q(x 0:T ), θ]. where the functional F is defined as F[q(x 0:T ), θ] = q(x 0:T ) log p(x 0:T, y 1:T θ) dx 0:T. q(x 0:T ) Idea of EM: We can maximize the lielihood by iteratively maximizing the lower bound F[q(x 0:T ), θ].

Expectation maximization (EM) algorithm [2/5] Abstract EM The maximization of the lower bound can be done by coordinate ascend as follows: 1 Start from initial guesses q (0), θ (0). 2 For n = 0, 1, 2,... do the following steps: 1 E-step: Find q (n+1) = arg max q F[q, θ (n) ]. 2 M-step: Find θ (n+1) = arg max θ F[q (n+1), θ]. To implement the EM algorithm we need to be able to do the maximizations in practice. Fortunately, it can be shown that q (n+1) (x 0:T ) = p(x 0:T y 1:T, θ (n) ).

Expectation maximization (EM) algorithm [3/5] We now get F[q (n+1) (x 0:T ), θ] = p(x 0:T y 1:T, θ (n) ) log p(x 0:T, y 1:T θ) dx 0:T p(x 0:T y 1:T, θ (n) ) log p(x 0:T y 1:T, θ (n) ) dx 0:T. Because the latter term does not depend on θ, maximizing F[q (n+1), θ] is equivalent to maximizing Q(θ, θ (n) ) = p(x 0:T y 1:T, θ (n) ) log p(x 0:T, y 1:T θ) dx 0:T.

Expectation maximization (EM) algorithm [4/5] EM algorithm The EM algorithm consists of the following steps: 1 Start from an initial guess θ (0). 2 For n = 0, 1, 2,... do the following steps: 1 E-step: compute Q(θ, θ (n) ). 2 M-step: compute θ (n+1) = arg max θ Q(θ, θ (n) ). In state space models we have log p(x 0:T, y 1:T θ) T T = log p(x 0 θ) + log p(x x 1, θ) + log p(y x, θ).

Expectation maximization (EM) algorithm [5/5] Thus on E-step we compute Q(θ, θ (n) ) = p(x 0 y 1:T, θ (n) ) log p(x 0 θ) dx 0 T + p(x, x 1 y 1:T, θ (n) ) T + log p(x x 1, θ) dx dx 1 p(x y 1:T, θ (n) ) log p(y x, θ) dx. In linear models, these terms can be computed from the RTS smoother results. In non-gaussian models we can approximate these using Gaussian RTS smoothers or particle smoothers. On M-step we maximize Q(θ, θ (n) ) with respect to θ.

State augmentation Consider a model of the form x = f(x 1, θ) + q 1 y = h(x, θ) + r We can now rewrite the model as θ = θ 1 x = f(x 1, θ 1 ) + q 1 y = h(x, θ ) + r Redefining the state as x = (x, θ ), leads to the augmented model with without unnown parameters: x = f( x 1 ) + q 1 y = h( x ) + r This is called state augmentation approach. The disadvantage is the severe non-linearity and singularity of the augmented model.

Energy function for linear Gaussian models [1/3] Consider the following linear Gaussian model with unnown parameters θ: x = A(θ) x 1 + q 1 y = H(θ) x + r Recall that the Kalman filter gives us the Gaussian predictive distribution Thus we get p(x y 1: 1, θ) = N(x m (θ), P (θ)) p(y y 1: 1, θ) = N(y H(θ) x, R(θ)) N(x m (θ), P (θ)) dx = N(y H(θ) m (θ), H(θ) P (θ) HT (θ) + R(θ)).

Energy function for linear Gaussian models [2/3] Energy function for linear Gaussian model The recursion for the energy function is given as ϕ (θ) = ϕ 1 (θ) + 1 2 log 2π S (θ) + 1 2 vt (θ) S 1 (θ) v (θ), where the terms v (θ) and S (θ) are given by the Kalman filter with the parameters fixed to θ: Prediction: (continues... ) m (θ) = A(θ) m 1(θ) P (θ) = A(θ) P 1(θ) A T (θ) + Q(θ).

Energy function for linear Gaussian models [3/3] Energy function for linear Gaussian model (cont.) (... continues) Update: v (θ) = y H(θ) m (θ) S (θ) = H(θ) P (θ) HT (θ) + R(θ) K (θ) = P (θ) HT (θ) S 1 (θ) m (θ) = m (θ) + K (θ) v (θ) P (θ) = P (θ) K (θ) S (θ) K T (θ).

EM algorithm for linear Gaussian models The expression for Q for the linear Gaussian models can be written as Q(θ, θ (n) ) = 1 2 log 2π P 0(θ) T 2 log 2π Q(θ) T log 2π R(θ) 2 { 1 [ 2 tr P 1 0 (θ) P s 0 + (ms 0 m 0(θ)) (m s 0 m 0(θ)) T]} T [ ] {Q } 2 tr 1 (θ) Σ C A T (θ) A(θ) C T + A(θ) Φ A T (θ) T [ {R 2 tr 1 (θ) D B H T (θ) H(θ) B T + H(θ) Σ H (θ)] } C = 1 T T, Σ = 1 T P s T + ms [ms ]T Φ = 1 T P s 1 T + ms 1 [ms 1 ]T B = 1 T y [m s T ]T T P s GT 1 + ms [ms 1 ]T D = 1 T y y T T.

EM algorithm for linear Gaussian models (cont.) If θ {A, H, Q, R, P 0, m 0 }, we can maximize Q analytically by setting the derivatives to zero. Leads to an iterative algorithm: run RTS smoother, recompute the estimates, run RTS smoother again, recompute estimates, and so on. The parameters to be estimated should be identifiable for the ML/MAP to mae sense: for example, we cannot hope to blindly estimate all the model matrices. EM is only an algorithm for computing ML (or MAP) estimates. Direct energy function optimization often converges faster than EM and should be preferred in that sense. If a RTS smoother implementation is available, EM is sometimes easier to implement.

Gaussian filtering based energy function approximation Let s consider parameter estimation in non-linear models of the form x = f(x 1, θ) + q 1 y = h(x, θ) + r We can now approximate the energy function by replacing Kalman filter with a Gaussian filter. The approximate energy function recursion becomes ϕ (θ) ϕ 1 (θ) + 1 2 log 2π S (θ) + 1 2 vt (θ) S 1 (θ) v (θ), where the terms v (θ) and S (θ) are given by a Gaussian filter with the parameters fixed to θ.

Gaussian smoothing based EM algorithm The approximation to Q function can now be written as Q(θ, θ (n) ) 1 2 log 2π P 0(θ) T 2 log 2π Q(θ) T log 2π R(θ) { 2 1 [ 2 tr P 1 0 (θ) P s 0 + (ms 0 m 0(θ)) (m s 0 m 0(θ)) T]} 1 2 1 2 T T { ]} tr Q 1 (θ) E [(x f(x 1, θ)) (x f(x 1, θ)) T y 1:T { ]} tr R 1 (θ) E [(y h(x, θ)) (y h(x, θ)) T y 1:T, where the expectations can be computed using the Gaussian RTS smoother results.

Particle filtering approximation of energy function [1/3] In the particle filtering approach we can consider generic models of the form θ p(θ) x 0 p(x 0 θ) x p(x x 1, θ) y p(y x, θ), Using particle filter results, we can form an importance sampling approximation as follows: where and w (i) 1 v (i) p(y y 1: 1, θ) i w (i) 1 v (i), = p(y x (i), θ) p(x(i) x (i) 1, θ) π(x (i) x (i) 1, y 1:) are the previous particle filter weights.

Particle filtering approximation of energy function [2/3] SIR based energy function approximation 1 Draw samples x (i) x (i) from the importance distributions π(x x (i) 1, y 1:), i = 1,..., N. 2 Compute the following weights v (i) = p(y x (i), θ) p(x(i) x (i) 1, θ) π(x (i) x (i) 1, y 1:) and compute the estimate of p(y y 1: 1, θ) as ˆp(y y 1: 1, θ) = i w (i) 1 v (i)

Particle filtering approximation of energy function [3/3] SIR based energy function approximation (cont.) 3 Compute the normalized weights as w (i) w (i) 1 v (i) 4 If the effective number of particles is too low, perform resampling. The approximation of the marginal lielihood of the parameters is: p(y 1:T θ) ˆp(y y 1: 1, θ), and the corresponding energy function approximation is T ϕ T (θ) log p(θ) log ˆp(y y 1: 1, θ).

Particle Marov chain Monte Carlo (PMCMC) The particle filter based energy function approximation can now be used in Metropolis Hastings based MCMC algorithm. With finite N, the lielihood is only an approximation and thus we would expect the algorithm to be an approximation only. Surprisingly, it turns out that this algorithm is an exact MCMC algorithm also with finite N. The resulting algorithm is called particle Marov chain Monte Carlo (PMCMC) method. Computing ML and MAP estimates via the particle filter approximation is problematic, because resampling causes discontinuities to the lielihood approximation.

Particle smoothing based EM algorithm Recall that on E-step of EM algorithm we need to compute Q(θ, θ (n) ) = I 1 (θ, θ (n) ) + I 2 (θ, θ (n) ) + I 3 (θ, θ (n) ), where I 1 (θ, θ (n) ) = p(x 0 y 1:T, θ (n) ) log p(x 0 θ) dx 0 T I 2 (θ, θ (n) ) = p(x, x 1 y 1:T, θ (n) ) T I 3 (θ, θ (n) ) = log p(x x 1, θ) dx dx 1 p(x y 1:T, θ (n) ) log p(y x, θ) dx. It is also possible to use particle smoothers to approximate the required expectations.

Particle smoothing based EM algorithm (cont.) For example, by using bacward simulation smoother, we can approximate the expectations as I 1 (θ, θ (n) ) 1 S I 2 (θ, θ (n) ) I 3 (θ, θ (n) ) T 1 =0 T S i=1 1 S 1 S log p( x (i) 0 θ) S i=1 S i=1 log p( x (i) +1 x (i), θ) log p(y x (i), θ).

Summary The marginal posterior distribution of parameters can be computed from the results of Bayesian filter. Given the marginal posterior, we can e.g. use optimization methods to compute MAP estimates or sample from the posterior using MCMC methods. Expectation maximization (EM) algorithm can also be used for iterative computation of ML or MAP estimates using Bayesian smoother results. The parameter posterior for linear Gaussian models can be evaluated with Kalman filter. The expectations required for implementing EM algorithm for linear Gaussian models can be evaluated with RTS smoother. For non-linear/non-gaussian models the parameter posterior and EM-algorithm can be approximated with Gaussian filters/smoothers and particle filters/smoothers.