Sequential Bayesian Inference for Dynamic State Space. Model Parameters

Sequential Bayesian Inference for Dynamic State Space Model Parameters Arnab Bhattacharya and Simon Wilson 1. INTRODUCTION Dynamic state-space models [24], consisting of a latent Markov process X 0,X 1,... and noisy observations Y 1,Y 2,... that are conditionally independent, are used in a wide variety of applications e.g. wireless networks [8], object tracking [21], econometrics [7] etc. The model is specified by an initial distribution p(x 0 ), a transition kernel p(x t x t 1, ) and an observation distribution p(y t x t, ). These distributions are defined in terms of a set of K static (e.g. non-time varying) parameters =( 1,..., K ). The joint model to time T is: p(y 1:T, x 0:T, )= TY p(y t x t, )p(x t x t t=1 1, )! p(x 0 )p( ), (1.1) where y 1:T =(y 1,...,y T ), etc. These models are also known as hidden Markov models [20]. In this paper we focus on sequential Bayesian estimation of static (i.e. not timedependent) parameter for these models; at time T, we observe y T and wish to compute the posterior distribution p( y 1:T ). Further, this is to be done in a setting where on-line estimation is required, so that there is an issue of trade-o between computation speed 1

and accuracy. The constraint on computation time means that for some su ciently large T it becomes infeasible to simply recompute p( y 1:T ) from scratch by Monte Carlo (e.g. MCMC) or a functional approximation method (e.g. the integrated nested Laplace approximation) or even by some o -line application of particle filter [6] adapted to also infer these static parameters [1], or maximum likelihood based filtering methods like [10]. We propose a method that accomplishes this for a fairly broad set of dynamic state space models. As regards the static parameter estimation problem, almost no closed form solutions are available, even in linear Gaussian models [2]. [24] show that conjugate sequential updates of the state and observation variances, as well as for x 0, are available for some specific cases. Noteworthy work specific to online inference on static parameters, applicable for general state space models are found in [14, 23] and [3]. [12] is a good overview of parameter estimation, including both o -line approaches and the use of sequential Monte Carlo for on-line parameter estimation. The rest of the chapter is organized as follows. Section 2. outlines the principle of the method. Sections 3. and describe one of the main issues to be resolved in order to implement the method: approximations to one-step ahead filtering and prediction densities. Section 4. illustrates the method and assesses its performance against alternative approaches. Section 5. contains some concluding remarks. 2. PRINCIPLE The principle of the proposed method is based on two fundamental theoretical ideas. The first idea is that many dynamic state space models have a relatively small number of static parameters, so that in principle p( y 1:T ) can be computed and stored on a discrete grid of practical size. In a good number of situations, the parameters are time varying processes themselves; and there are hyper-parameters that are static but 2

unknown. This has been noted as a property of many latent models [22]. It is noted that the transition kernel of some dynamic state space models is itself defined in terms of a set of static parameters ( 1 ) e.g. p(x t x t 1, t, 1 ) and time-varying parameters ( t ) that also evolve as a Markov process depending on some hyper-parameters ( 2 ) e.g. p( 0:T 2 ) = p( 0 2 ) Q T t=1 p( t t 1, 2 ); for example dynamic linear models with a trend [24]. Without loss of generality, such cases are also incorporated in our problem by considering (x t, t ) to be the latent process and by denoting the complete set of static parameters and hyper-parameters as. The second significant point to pay attention to is that there exists useful identities for parameter estimation in latent models. This identity, also known as basic marginal likelihood identity (BMI) is reported in [4] and is used for the calculation of marginal likelihood in the original paper. In this paper, the following approach is taken p( y 1:T ) / p(y 1:T, ) = p(y 1:T, x 0:T, ), (2.1) p(x 0:T y 1:T, ) x0:t =x ( ) valid for any x 0:T for which p(x 0:T y 1:T, ) > 0. Under the assumption that p(x 0:T ) is Gaussian, the above identity forms the basis of the integrated nested Laplace approximation (INLA) of [22]. Here a Gaussian approximation is made for the denominator term, and it is evaluated on a discrete grid of values of. The method also includes a way to derive such a grid intelligently. The value x 0:T = x ( ) is allowed to be a function of and typically x ( ) = arg max x0:t p(x 0:T y 1:T, ) is used, which is the mean of the Gaussian approximation to p(x 0:T y 1:T, ). Another useful identity is: p( y 1:T ) / p( y 1:T 1 ) p(y T y 1:T 1, ). (2.2) 3

In our case, we estimate p(y T y 1:T 1, ) by the following: p(y T y 1:T 1, )= p(y T x T, )p(x T y 1:T 1, ) p(x T y 1:T, ) ; (2.3) x T =x ( ) as with Equation 2.1, we choose x ( ) = arg max xt p(x T y 1:T, ). This identity is clearly useful for sequential estimation and does not su er from the dimension-increasing problem of Equation 2.1. Taking Equation 2.2, then if prediction and filtering approximations p(x T y 1:T 1, ) and p(x T y 1:T, ) are available, any approximation p( y 1:T 1 ) at time T 1 can be updated: p( y 1:T ) / p( y 1:T 1 ) p(y T y 1:T 1, ) = p( y 1:T 1 ) p(y T x T, ) p(x T y 1:T 1, ) p(x T y 1:T, ), (2.4) x T =x ( ) where x ( ) = arg max xt p(x T y 1:T, ). For of low dimension, computing Equation 2.4 on a discrete grid o ers the potential for fast sequential estimation. This suggests the following sequential estimation algorithm when approximate prediction and filtering distributions are available; and it is named as SINLA or Sequential INLA. Initially, p( y 1:T ) is approximated by INLA because it is accurate and produces a discrete grid T over which p( y 1:T ) is computed. At some time T INLA this will prove to be too slow to compute, and from then on the sequential update of Equation 2.4 will be used. The main issue that remain to be addressed in order to implement this algorithm is the form of the approximations p(x T y 1:T next section. 1, ) and p(x T y 1:T, ). It is addressed in the 4

3. PREDICTING AND FILTERING DENSITY APPROXIMATIONS For the Kalman filter (where p(y t x t, ), p(x 0 ) and p(x t x t 1, ) are linear and Gaussian), the prediction and filtering distributions are Gaussian, Equation 2.2 can be computed exactly, and the INLA approximation is also exact. The means and variances of these Gaussians are sequentially updated [18]. All that we need to store are the means and variances of the prediction and filtering distributions for each in the grid; from this p( y 1:T ) can be computed. An equivalent definition of Eq. 1.1, and one that is useful in describing some aspects of the approximations that we propose, is the general state-space representation: y t = f(x t,u t,v t, ); (3.1) x t = g(x t 1,w t, ), (3.2) where v t and w t are observation and system errors, and u t are (possibly non-existent) exogenous variables. The likelihood p(y t x t, )isspecifiedbyf and v t, while the transition density p(x t x t 1, )isspecifiedbyg and w t. Two of several algorithms found in the literature are listed here. 3.1 Basic Approximations When either the linear or Gaussian property does not hold, 2 extensions of the Kalman filter can be computed quickly. Extended KF The extended Kalman filter was one of the first generalisations of the Kalman filter to non-linear models [17]. It linearizes a non-linear model to create a Kalman filter 5

(e.g. Gaussian) approximation to the filtering and prediction densities [9]. Hence the prediction and filtering distribution approximations are Gaussian and make use of the fast sequential updating of their mean and variances. Unscented KF The unscented Kalman filter also produces Gaussian approximations to the filtering and prediction densities but avoids linearising by approximately propagating the means and covariances through the non-linear function [11]. It tends to be more accurate than the extended Kalman filter, more so for strongly non-linear models. The non-linearity in the model is propagated deterministically through a small set of points, known as sigma points. Weights are associated with each point and an estimate of the mean and variance of the Gaussian approximations to p(x t y 1:t 1, ) and p(x t y 1:t, ) is made as weighted means and variances of these points. The method is computationally fast as it only requires the updating of these points, and then the means and variances of the approximation, at each observation. 4. EXAMPLES: LINEAR DYNAMIC MODEL Our method is implemented on an example and compared to INLA (an o ine method) and a particle filter developed for online inference of static parameters. Average performance is measured across many replications of simulated data. To keep the computation time comparison fair, all methods were implemented in R [19]. In all these replications T INLA had been set at 20. The model performances were compared using two di erent measures, as described below: Mahalanobis distance: This is used as a measure to judge the accuracy of the estimates of the parameters in the model in a multivariate parameter space setting [15]. 6

Computation time: The time to compute the posterior approximation is also recorded. The statistical model in this example has been assumed to be of the form: y t = x t 1 1 + t (4.1) x t = x t 1 + t (4.2) where t N(0, 2 err ), 1 is a vector of 1 s and t MVN(0, ). The covariance matrix is assumed to be dependent on a single unknown parameter: = obs. The entries of are of the following type: 8 >< 1 if i = j, ij = >: exp( rd(i, j)) if i 6= j, where r>0 and d(i, j) is some measure of distance betweens nodes i and j. defines the well-known Gaussian spatial process [16]. Data has been generated by fixing the values at =0.7, 2 obs = 1 and sys 2 = 1, and it is assumed that y t is of dimension 3. Further, the form of is known and fixed. Instead of variance parameters, precision parameters are used which are denoted as Obs and Sys respectively. The algorithm is run with T INLA = 20. The Kalman filter has been used for optimal filtering at each step, since the system and observation equations are linear. The above simulation has been replicated 10 times for SINLA and 5 times each for INLA and BSMC. For SINLA (and also INLA), the AR parameter has a normal prior with mean 0.1 and s.d. 1, truncated at 0.99 and 0.99. Both Sys and Obs have a gamma prior with parameters 3 and 3. More stronger priors were provided for BSMC. The prior for is normal with mean 0.5 and s.d. 1, 7

again truncated at are now set at 1. 0.99 and 0.99; where as for the gamma priors, both the parameters The approximate mode of each of the parameters along with the approximate 95% probability bounds are shown in Figure 1. Figure 1 shows that our method works well initially, but the performance degrades over time. The starting grid, as computed using INLA, is not su cient enough to cover for the support of the marginals as T increases. The mode of the posterior is at the tail of the support of grid-points, hinting at the fact that the grid needs to shift over time. For T>T INLA, fast alternative methods of updating the grid must be determined. Table 1 has been constructed to compare the accuracy and computation-time of our algorithm with two methods; namely the online Bayesian sequential Monte Carlo (BSMC) by [14] and o -line Bayesian inference with INLA. The comparison is done for T = 500 and T = 1000, based on point estimates, Mahalanobis distance and computation times for each of the methods. Table 1: Table containing point estimates of the parameters and measures of accuracy and computation time for the three algorithms, namely SINLA, BSMC and INLA. Methods T = 500 T = 1000 Estimates MD Time(s) Estimates MD Time(s) SINLA 0.68, 0.98, 1.01 1.81 73.63 0.67, 0.96, 1.01 1.94 136.71 BSMC 0.72, 1.66, 0.86 2230.11 0.69, 1.68, 0.81 4437.83 INLA 0.7, 0.99, 1.00 0.71 3358.67 0.7, 1.00, 1.00 0.79 23631.54 INLA, as one can expect, is computationally the most expensive algorithm while producing very accurate outputs. The particle filter has been implemented using the R package pomp [13]. The computation time of the SMC is much higher than the new algorithm; and it is partially dependent on the number of particles used. For this example, 10000 particles have been used, with the intention of achieving greater accuracy. But degeneracy sets in in all the examples, causing the output to be extremely inaccurate. No Mahalanobis distance values are reported for BSMC as they are extremely high, 8

caused due to very low estimated covariance matrix and relatively inaccurate point estimates. The point estimates also shows the relative inaccuracy of BSMC compared to the other two methods. 5. CONCLUDING REMARKS A method of fast sequential parameter estimation for dynamic state space models has been proposed and compared to two alternatives: the integrated nested Laplace approximation, and a particle filter. In all the examples that we consider, INLA proved the most accurate but the slowest. Our method achieved much better accuracy than the particle filter and also proved to be much faster than both the algorithms. It is also worth noting that our method does not su er from issues of degeneracy that a ect the SMC [5]. The principal disadvantages of the approach, in its current form are the following. The constant grid, computed using INLA does not cover the true support of the posterior over time. There is the need to develop a grid shifting algorithm, i.e. one that dynamically adds or drops grid points over time. Another crucial disadvantage of this method is that it is restricted to models with a relatively small number of fixed parameters. Finally there is a need to develop asymptotic properties related to the convergence of the filter. While some idea of the accuracy of posterior of Y 1:t seems to be directly related to the dimension of the latent process as shown by [22] for INLA, it needs to be extended in a sequential setting. We feel that the consistency properties of our filter are completely dependent on that of the state filtering mechanism. But this needs to be looked into as future work. 9

References [1] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269 342, 2010. [2] Christophe Andrieu, Arnaud Doucet, and Vladislav B. Tadic. On-line parameter estimation in general state-space models. In IEEE, editor, Proceedings of the 44th IEEE Conference on Decision and Control, pages 332 337, 2005. [3] Carlos M. Carvalho, Michael Johannes, Hedibert F. Lopes, and Nicholas Polson. Particle learning and smoothing. Statistical Science, 25(1):88 106, 2010. [4] Siddhartha Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313 1321, 1995. [5] A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: fifteen years later. In Oxford Handbook of Nonlinear Filtering. Oxford University Press, 2009. [6] N.J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non- Gaussian Bayesian state estimation. Radar and Signal Processing, IEE Proceedings F, 140(2):107 113, 1993. [7] James D. Hamilton. State-space models. In R. F. Engle and D. McFadden, editors, Handbook of Econometrics, volume 4, chapter 50, pages 3039 3080. Elsevier, 1986. [8] S. Haykin, K. Huber, and Zhe Chen. Bayesian sequential state estimation for mimo wireless communications. Proceedings of the IEEE, 92(3):439 454, 2004. [9] Simon Haykin. Kalman filtering and neural networks. Wiley-Interscience, 2001. 10

[10] Edward L. Ionides, Anindya Bhadra, Yves Atchadé, and Aaron King. Iterated filtering. Annals of Statistics, 39(3):1776 1802, 2011. [11] S. J. Julier and J.K. Uhlmann. A new extension of the Kalman filter to nonlinear systems. In Proceedings of AeroSense: The 11th International Symposium on Aerospace/Defense Sensing, Simulation and Controls, Orlando, Florida, pages 182 193, 1997. [12] N. Kantas, Sumeetpal S. Singh, and J.M. Maciejowski. An overview of sequential Monte Carlo methods for parameter estimation in general state-space models. In Proceedings IFAC System Identification (SySid) Meeting, 2009. [13] Aaron A. King, Edward L. Ionides, Carles Martinez Bretó, Steve Ellner, Bruce Kendall, Helen Wearing, Matthew J. Ferrari, Michael Lavine, and Daniel C. Reuman. pomp: Statistical inference for partially observed Markov processes (R package), 2010. [14] J Liu and M West. Combined parameter and state estimation in simulation-based filtering. In De Freitas and N. J. Gordon, editors, Sequential Monte Carlo Methods in Practice. New York. Springer-Verlag, New York, 2000. [15] P. C. Mahalanobis. On the generalised distance in Statistics. Proceedings of the National Institute of Sciences of India, 2(1):49 55, 1936. [16] B. Matérn. Spatial Variation. Springer-Verlag, 1986. [17] B. A. McElhoe. An assessment of the navigation and course corrections for a manned flyby of Mars or Venus. IEEE Transactions on Aerospace and Electronic Systems, 2:613 623, 1966. [18] Richard J. Meinhold and Nozer D. Singpurwalla. Understanding the Kalman filter. The American Statistician, 37(2):123 127, 1983. 11

[19] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2008. [20] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proceedings of the IEEE, volume 77, pages 257 286, February 1989. [21] Branko Ristic, Sanjeev Arulampalam, and Neil Gordon. Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House, 2004. [22] H. Rue, S. Martino, and N. Chopin. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 71:319 392, 2009. [23] Geir Storvik. Particle filters for state-space models with the presence of unknown static parameters. IEEE Transactions on Signal Processing, 50(2):281 289, February 2002. [24] M. West and J. Harrison. Bayesian forecasting and dynamic models. Springer series in Statistics. Springer, second edition, 1997. 12

(a) AR parameter x (b) State precision parameter y (c) Observation precision parameter Figure 1: Plots (a), (b) and (c) represent trace plots showing trajectories of the averaged approximate mode and approximate 95% probability bounds of the posteriors of, Sys and Obs respectively. The light grey line displays the true parameter value. 13