Recursive Kernel Density Estimation of the Likelihood for Generalized State-Space Models

Size: px

Start display at page:

Download "Recursive Kernel Density Estimation of the Likelihood for Generalized State-Space Models"

Quentin Dustin Hall
5 years ago
Views:

1 Recursive Kernel Density Estimation of the Likelihood for Generalized State-Space Models A.E. Brockwell February 28, 2005 Abstract In time series analysis, the family of generalized state-space models is extremely rich. However, their likelihood functions are intractable, except in certain special cases, and this limits the options in analyses. In practice, a study typically (1) uses some kind of approximation to the likelihood function, for instance, one obtained analytically or by making use of the particle filter or related methods, (2) adopts a standard Markov chain Monte Carlo approach to parameter estimation, or (3) sacrifices goodness-of-fit for numerical convenice by choosing an approximating model for which the likelihood can be computed. Each of these approaches has advantages and disadvantages, but since none of them yields a consistent estimate of the likelihood, model selection remains an outstanding problem for the general family. This paper addresses this problem by introducing a recursive estimator of the log-likelihood for the generalized state-space model, which is obtained as a kernel density estimator driven by the iterations of a Markov chain. The estimator is very simple to compute, and is shown to converge almost surely to the exact log-likelihood as the number of iterations of the Markov chain approaches infinity. Keywords: generalized, state-space model, non-gaussian, nonlinear, likelihood, recursive, kernel density, estimator, Markov chain, dynamic model 1 Introduction The family of generalized state-space models (also sometimes referred to as nonlinear dynamic models, and partly discussed, for instance, in Shumway and Stoffer, 2000; Brockwell and Davis, 2002; West and Harrison, 1997) is arguably the richest family of time series models 1

2 considered in the literature. A generalized state-space model consists of two components, a latent process, called the state process, usually assumed to be Markovian, and an observation process, whose elements have conditional distributions given values of corresponding elements of the latent process. Special cases of the model include classical ARIMA models (Box and Jenkins, 1970) (multivariate and non-gaussian varieties included), financial models such as GARCH models (Engle, 1982; Bollerslev, 1986) and stochastic volatility models (Ghysels et al., 1996; Taylor, 1994), a range of nonlinear models used in engineering (a number of interesting examples can be found in Doucet et al., 2001) models for censored time series, time series of counts (such as the neuron-spiking time series considered in Brockwell et al., 2004), and many others. Even for a number of widely-studied special cases, analysis in the literature is carried out either by using approximating models with tractable likelihood functions, or by making use of likelihood approximations. This has at least two drawbacks. One is that resulting parameter estimates (and forecasts based on those parameters) are likely to be biased. The other is that without the likelihood, model comparison and selection is difficult. Of course, in certain cases, the approximations used may be good enough that the exact likelihood is not necessary, but without a means of computing the exact likelihood (or otherwise bounding the approximation error), it is not possible to know the degree to which the approximation error matters. Some of these problems can be avoided by adopting a Bayesian approach to inference. By doing so, it is possible, at least, to estimate posterior distributions of parameters for more complicated special cases, but model comparison remains a serious problem, since effective methods for likelihood calculation have not been developed in this context. (The key problem here is that the likelihood is in fact an integral over all possible values of the latent state process.) Thus, for the sake of (1) allowing likelihood-based analysis of a richer class of time series models than previously considered in the literature, (2) performing model comparison and selection for such models, and (3) assessing the quality of approximations to the likelihood in various special cases, it is desirable to be able to compute the likelihood for the general family of models. In this paper, we combine the techniques of Markov chain simulation and recursive kernel density estimation to obtain an estimator of the log-likelihood for the model. We then show that the estimator converges almost surely to the true log-likelihood, as the number of iterations of a particular Markov chain increases to infinity. (As is usually the case, there is a penalty to pay for the increase in generality of the family of models. In the few special cases where the likelihood can be obtained exactly, the proposed estimator is more computationally demanding than the standard expressions.) The development of the estimator and proof of the consistency result is far less straightforward than it may sound. Standard Markov chain Monte Carlo techniques such as those developed by Carlin et al. (1992) are not useful for this purpose, because they do not yield draws from distributions required to compute the likelihood. Another nice (gradient-based) approach to parameter estimation for this class of models was developed recently by Andrieu and Doucet (2003), but their scheme requires the model to be stationary, and does not provide a consistent estimator of the likelihood for a 2

3 finite-length time series. Therefore a new simulation scheme is developed, which technically is not a Markov chain Monte Carlo scheme, since the limiting distribution is unknown (even to within a constant of proportionality), even though its marginals are known. Furthermore, although consistency results have been established for recursive kernel density estimators based on iid samples (Wolverton and Wagner, 1969; Yamato, 1971; Wegman and Davies, 1979), and based on samples which are identically distributed but dependent (Masry and Györfi, 1987), such results have not appeared for recursive density estimators based on samples from an ergodic Markov chain (although it is worth noting that Yu, 1994, develops results similar to those needed here, but for non-recursive kernel density estimators.) It is also worth discussing the relationship between the approach developed in this paper and particle filtering-based methods. Such methods, developed in their modern form in Kitagawa (1996); Gordon et al. (1993), and discussed in detail in Doucet et al. (2001), are arguably some of the most important recent developments in dealing with the generalized state-space model. They rely on particle, or Monte Carlo, approximations to conditional distributions of latent states, given observations, and in theory, they could also be used along with kernel density estimation to estimate the log-likelihood for a given model. A key difference between that approach and the approach proposed in this paper is the nature of convergence of the estimator. Particle filtering schemes yield conditional distributions converging to the correct distribution as the number of particles increases to infinity. But it is not possible to increase the number of particles without re-running a particle filter, hence a process of sequentially increasing the number of particles until an estimator becomes good enough is very inefficient. In contrast, the approach developed in this paper simply involves repeated scanning through the time series, and (almost sure) convergence occurs as the number of scans increases. The paper is organized as follows. In Section 2, we give a formal definition of the generalized state-space model, and we introduce our estimator of the log-likelihood. In Section 3 we present the relevant convergence results. In Section 4, we give a simple example of the estimator for simulated data coming from a model where the exact log-likelihood can be computed. In Section 5, we discuss additional potential applications of the results in this paper, and in the appendix, we prove the main results. 3

4 2 The Method 2.1 The Model Formally, the generalized state-space model is defined on a probability space (Ω, F, P ). The Markovian state process {X t R p, t = 1, 2,..., T } satisfies P (X 1 A) = f 0 (x 1 )dλ(x 1 ), for all A B p, A P (X t+1 A X t = x) = f t (x t+1 x t )dλ(x t+1 ), for all A B p, (1) A where B p is the Borel σ-field on R p and f t ( ) is a specified conditional probability density function (the transition density of the Markov chain) with respect to a measure λ on (R p, B p ) (usually, but not necessarily taken to be Lebesgue measure). The observed process {Y t R q, t = 1, 2,..., T } satisfies P (Y t A {X t, t Z}, {Y s, s < t}) = g t (y t x t )dν(y t ), for all A B q, (2) where g t ( ) is a conditional probability density with respect to a measure ν on (R q, B q ) (also often taken to be Lebesgue measure), referred to as the observation density. For the sake of computing likelihoods and residuals, one is interested in the conditional densities π t (x t ) of X t, given observations Y 1,..., Y t, for t = 1, 2,..., T, and the conditional one-step predictive densities p t (x t ) of X t, given observations Y 1,..., Y t 1, for t = 1, 2,..., T. These are referred to, respectively, as the filtering densities and predictive densities, and can be obtained recursively by making use of the equalities p t (x t ) = f t 1 (x t x t 1 )π t 1 (x t 1 )dλ(x t 1 ) (3) and A π t (x t ) p t (x t )g t (y t x t ). (4) (One typically starts with p 1 (x 1 ) = f 1 (x 1 ), then uses the two equations above to compute, in sequence, π 1 (x 1 ), p 2 (x 2 ), π 2 (x 2 ),....) The predictive densities are useful, in particular, because the log-likelihood can be expressed as l(y 1,..., y T ) = T log(q t (y t )), (5) t=1 where q t (y t ) = g t (y t x t )p t (x t )dλ(x t ) (6) 4

5 is the one-step predictive density of Y t, given Y 1 = y 1,..., Y t 1 = y t 1. Evaluation of the likelihood is generally difficult, since integrals accumulate in the recursions (3,4). An important exception is the celebrated special case of this model, the linear Gaussian state-space model, where X t+1 N(AX t, Σ) and Y t N(BX t, Λ), for appropriately sized matrices A, B, Σ, and Λ. In this case, assuming that f 0 (x 1 ) is a (multivariate) normal density, all the filtering and predictive densities turn out to be (multivariate) normal, and their means and variances can be determined using the well-known Kalman recursions (see Kalman, 1960). 2.2 The Estimator Suppose that data {Y t = y t, t = 1, 2,..., T } are observed. Roughly speaking, our estimator of the log-likelihood is obtained by generating a Markov chain {Z i R p T, i = 1, 2,...}, then generating draws {W i R q T } conditionally on the values of {Z i }, with the property that as i, the density of the t-th component of W i approaches q t ( ), and finally, feeding the values W i into recursive kernel density estimators. Because {(Z i, W i )} is Markovian, the update at iteration i can be made without knowledge of values {(Z j, W j ), j < (i 1)}, thus the estimator is truly recursive. Note that the procedure we propose to generate the Markov chain is not a Metropolis-Hastings algorithm, even though it contains superficial similarities to one. For convenience, in what follows, we will make the following assumption. Assumption 2.1 The measure ν( ) in (2) is q-dimensional Lebesgue measure, and each density function g t (y t, ) is strictly positive with a finite upper bound. Remark: It is possible to adapt the scheme and results in this paper to handle cases where this assumption does not hold, but for the sake of clarity of exposition, we do not present those results here. The estimator is obtained as follows. Let Z 0 = (Z 0,1,..., Z 0,T ) be a collection of T random vectors (representing an initial guess of the state sequence {X 1,..., X T }), in R p T. Then construct a Markov chain and a sequence using the following procedure. {Z i R p T, i = 1, 2,...}, Z i = (Z i,1,..., Z i,t ) {W i R q T, i = 1, 2,...}, W i = (W i,1,..., W i,t ) 5

6 Procedure 2.2 (Markov transition from Z i 1 to Z i ) For t = 1, 2,..., T, carry out the following steps. 1. If t = 1, draw Q from f 1 (z), otherwise (if t > 1) draw Q from the transition density f t 1 ( Z i,t 1 ). 2. Draw W i,t from the density g t ( Q). 3. Compute α = min (1, g t (y t Q)/g t (y t Z i 1,t )). With probability α, set Z i,t = Q. Otherwise set Z i,t = Z i 1,t. All draws of Q,W i,t, and acceptance decisions (Step 3), over differing values of t and i, are required to be mutually independent, conditioned on the observations used to determine their distributions. The draws W i,t obtained in Procedure 2.2, for large i, can be regarded as (dependent) draws from a distribution with density approaching q t ( ). Hence it would be possible to construct estimators of the one-step predictive densities q t (y t ), using the standard kernel density estimators q (n) t (y t ) = 1 n n i=1 ( ) 1 yt W i,t K, (7) h q n where K( ) is a kernel function and {h i, i = 1, 2,...} is a sequence of bandwidths. (See Scott, 1992, for more details on kernel density estimation.) Both the function K( ) and the bandwidths h i are required to satisfy certain standard conditions (stated in the next section). The drawback is that to compute each of these estimates, one must keep track of the entire set of draws W i,t, i = 1, 2,..., n. Therefore it is much more convenient to use the recursive form of the kernel density estimator. This has the same form as a standard kernel density estimator, but in (7) the bandwidths are indexed by i, rather than n. In other words, we write ˆq (n) t (y t ) = 1 n ( ) 1 yt W i,t K, (8) n This allows the estimator to be expressed in the recursive form ( ) ( ˆq (n) n 1 yt W n,t t (y t ) = n i=1 h q i h n h i ˆq (n 1) t (y t ) + 1 nh q K n h n ). (9) A natural choice for a recursive estimator of the log-likelihood of the generalized state-space model (c.f. (5)) is then T ˆln (y 1,..., y T ) = log(ˆq (n) t (y t )). (10) t=1 6

7 3 Analysis In this section, we state our two main results. We will use terminology as in Meyn and Tweedie (1993), in particular, a Markov chain {X t } taking values in R d will be said to be uniformly ergodic if it has a unique invariant distribution ν (also referred to as a limiting distribution) such that sup P n (x, ) ν T V 0, as n, x R d where P n (x, ) = Pr(X n+1 A X n = x) is the transition kernel of the chain, and T V denotes the total variation norm on signed measures. (Note that uniform ergodicity implies irreducibility and aperiodicity.) The first result says that the simulation procedure yields draws from distributions converging to the filtering distributions. Theorem 3.1 Suppose that for the model (1,2), Assumption 2.1 holds. Suppose also that an arbitrary initial set of states Z 0 is chosen, and that Procedure 2.2 is used to generate a Markov chain {Z i, i = 1, 2,...}. Then {Z i } is a uniformly ergodic Markov chain with a limiting distribution ν on (R (p T ), B (p T ) ), and the marginal distributions of ν have densities ν 1,..., ν T, with respect to λ, given by ν t (x) = π t (x), where π t denotes the conditional density of X t, given Y 1 = y 1,..., Y t = y t. Remark: Although the scheme contains marginal updates which are consistent with Metropolis-Hastings updates, the overall scheme is not a Metropolis-Hastings scheme. The limiting distribution of the chain {Z i } is not known (at least to the author, at the time of writing this paper), only its marginals are known. We next impose standard restrictions on the kernel function K( ) and the bandwidths {h n }. Assumption 3.2 The kernel function K( ) is bounded, integrable with K(u)du = 1, satisfies uk(u)du = 0, and has an integrable radial majorant ψ(x) = sup y x K(y). Furthermore, the sequence of bandwidths {h j } is given by for some h 0 > 0 and 0 < α < 1/q. h n = h 0 n α, (11) 7

8 Remark: It is possible to allow a more general functional form for h n, but the expression in (11) is widely used, and convenient to compute. In fact, it has been shown (see, e.g. Scott, 1992) that under certain assumptions, in the case where q = 1, choosing h 0 to be 1.06 times the standard deviation of the distribution whose density is being estimated, and α = 1/5 is a good choice for the non-recursive form of the estimator. The next result establishes consistency of the estimator as the number of iterations of the Markov chain increases. It relies on an extension of a consistency result of Masry and Györfi (1987), given in the appendix. Theorem 3.3 Suppose that the conditions of Theorem 3.1 are satisfied. Let ˆl n (y 1,..., y T ) be the recursive estimator of the log-likelihood, given by (9,10), with kernel function and bandwidths satisfying Assumption 3.2. Then, for almost all (y 1,..., y T ), lim ˆln (y 1,..., y T ) = l(y 1,..., y T ), n almost surely. Remark: In Theorem 3.3, almost sure convergence does not occur in the usual sense of convergence as the amount of data becomes infinitely large. Rather, it occurs as the number of Markov chain iterations grows. Thus, arbitrarily precise estimates can be obtained for a given model, and given observed data {y 1,..., y T }, simply by allowing the number of iterations of the Markov chain {Z i } to increase. 4 Example As a simple illustration of the procedure, consider a Gaussian first-order autoregressive process with additive Gaussian noise. To be more precise, let the model be given by X t+1 = 0.5X t + Z t, t = 1, 2,..., {Z t } iid N(0, 1) Y t = X t + W t, {W t } iid N(0, 1), with X 1 N(0, 4/3), where the processes {W t } and {Z t } are independent of each other and of X 1. We simulate observations {y 1,..., y 100 } from this model, and then it is a simple matter to compute the exact log-likelihood using the Kalman filter, and the estimates ˆl n (y 1,..., y 100 ) (in both cases, assuming parameters are known). Figure 1 shows the resulting estimate ˆl n (...) as a function of n, with the true log-likelihood shown as a horizontal line. 8

9 Kernel Density Estimator Exact Log Likelihood Log Likelihood Markov Chain Iteration x 10 4 Figure 1: Comparison of the estimator ˆl n (...) as n increases from 1 to , for a simulated time series of length 100. In this case, since the model is linear and Gaussian, the true loglikelihood, shown as the horizontal line around , can be calculated using the Kalman filter. In this particular example, of course, it is far quicker to use the Kalman filter to compute the log-likelihood. However, the Kalman filter is not applicable in the more general case covered in this paper. Notice also that reasonably good estimates of the log-likelihood are obtained in this example after only a few thousand iterations of the Markov chain. 5 Discussion In this paper, we have introduced a Markov chain simulation procedure which yields draws from distributions approaching the filtering and one-step predictive distributions as the number of Markov chain iterations increases, and shown that the draws from the Markov chain can be used in recursive kernel density estimators to obtain consistent estimates of the loglikelihood. A potentially useful by-product of this procedure, not considered in this paper, is sets of residuals. Since the procedure yields consistent estimates of the one-step predictive densities q t ( ), it can also be readily adapted to give estimates of the one-step predictive cumulative density functions Q t (y t ) = y t q t(y)dy. In the one-dimensional case, these can be trivially converted into residuals by the transformation R t = Φ 1 (Q t (y t )), where Φ( ) denotes the 9

10 inverse of the standard normal cumulative distribution function. If the model were indeed correct, then these residuals would be realizations of a set of iid standard normal random variables. Thus checks for model goodness-of-fit could be carried out using standard batteries of tests for iid normal random variables. Another aspect worth further study is asymptotic distributional properties of the estimators. While consistency is nice, it would be useful, if carrying out optimization over parameter space, for instance, to be able to estimate the error in the log-likelihood estimator, as a function of the number of Markov chain iterations. 6 Acknowledgments The author is grateful to Arnaud Doucet, Larry Wasserman and Peter Spirtes for valuable comments and discussion related to various aspects of parts of this work. A Proofs First we give the proof that Procedure 2.2 generates a Markov chain with limiting distribution whose marginal distributions are the filtering distributions. Proof of Theorem 3.1: The proof consists of two main parts In the first part, we show that {Z i } is a uniformly ergodic Markov chain, while in the second we show that the marginal distributions of the limiting distribution match the desired filtering distributions. It will be useful to define, for m = 1, 2,..., T, Z (m) i = (Z i,1,..., Z i,m ) R p m. (Note that Z (T ) i = Z i.) Part 1. Consider the sequence {Z i,m }, for some fixed m {1,..., T }. It is easily verified that the sequence is a Markov chain. Let its transition kernel be denoted by R k m(x, A) = Pr(Z (m) i+k A Z(m) i = x), x R p m, A B p m. Let µ (m) X : Bp m R denote the distribution of the process {X 1,..., X m }. Then for k 1, Rm(x, 1 A) α m (x, z)dµ (m) X (z), (12) z A 10

11 where α m (x, z) is the probability of accepting all proposals from t = 1 to t = m, given by α m (x, z) = m t=1 t=1 ( min 1, g ) t(y t z t ). (13) g t (y t x t ) This expression can be bounded below by a function of z, m ( ) g t (y t z t ) α m (x, z) α m (z) := min 1, > 0. (14) sup x R p g t (y t x) (The fact that α m (z) > 0 follows immediately from Assumption 2.1.) Combining (12) and (13), we see that for every A such that µ (m) X (A) > 0, Rm 1 (x, A) α m+1 (z)dµ (m+1) X (z) := ζ (m) (A) > 0. (15) A It follows directly from inequality (15), since the measure ζ (m) is non-trivial, and doesn t depend on x, that the entire state-space R p T constitutes a small set (see Meyn and Tweedie, 1993, for the definition of a small set) Thus by Theorem (v) of Meyn and Tweedie (1993), the Markov chain is uniformly ergodic. In other words, it has a limiting distribution ν (m), and (see Theorem of Meyn and Tweedie, 1993) for some constants 0 < c m < and 0 < ρ m < 1, sup R k (x, ) ν m ( ) c m ρ k m, (16) x R p T where denotes the total variation norm of a signed measure. Part 2. It remains to be established that the corresponding marginal distributions match the filtering distributions π t. This can be done inductively. First consider {Z (1) i, i = 1, 2,...}. We know from the previous part that this is a uniformly ergodic Markov chain. It is easily checked that it is also a Metropolis-Hastings chain with proposal densities f 1 ( ) and a limiting distribution with density π 1 ( ). Therefore ν 1 has density π 1. Next, suppose that the marginal densities of ν (m) are ν (m) t ( ) = π t ( ), for t = 1, 2,..., m. We know that ν (m+1) exists, and since the first m components of Z (m+1) i are exactly equal to Z (m) i, the first m marginal densities of Z (m+1) i must be ν (m+1) t ( ) = π t ( ), t = 1, 2,..., m. Now suppose that for some i, Z (m+1) i ν (m+1). Then Z i+1,m π m, and is independent of Z i,m+1, so the proposal Q generated for Z i+1,m+1 has density p m+1 ( ) (recall the definition (3)), and is independent of Z i,m+1. The acceptance probability is min(1, g m+1 (y m+1 Q)/g m+1 (y m+1 Z i,m+1 )). This means that Z i+1,m+1 can be regarded as the result of an application of a Metropolis- Hastings transition kernel to the state Z i,m+1, for which the limiting distribution is proportional to p m+1 ( )g m+1 (y m+1 ) π m+1 ( ). 11

12 Thus there is only one density which both Z i,m+1 and Z i+1,m+1 can have (since this hypothetical Metropolis-Hastings transition kernel has a unique invariant distribution), and that must be π m+1 ( ). In other words, the marginal density ν (m+1) m+1 ( ) must be equal to π m+1 ( ). We have just shown that if the marginal densities of ν (m) are π 1,..., π m, then the (m + 1)st marginal density of ν (m+1) must be π m+1. Therefore, by induction, the marginal densities of ν (which is the same as ν (T ) ) must be π 1,..., π T. This completes the proof. Lemma A.1 Suppose that the conditions of Theorem 3.3 hold. Then the sequence {(Z i, W i ), i = 1, 2,...} is a uniformly ergodic Markov chain taking values in R p T +q T, with a limiting distribution for which the marginal distributions of the components corresponding to W i,1,..., W i,t have densities, with respect to Lebesgue measure, equal to the one-step predictive densities q 1,..., q T. Proof: It is easily verified (from inspection of Procedure 2.2) that the distribution of (Z i, W i ), given {(Z j, W j ), j < i}, depends only on Z i 1. Thus {(Z i, W i )} is a Markov chain. Let its transition kernel be denoted by S 1 (x, C) = Pr((Z i, W i ) C (Z i 1, W i 1 ) = x), x R p T +q T, C B p T +q T. p T +q T Consider a point x = (z, w) R and A B p T, B B q T. Then a particular product set C = A B, with where S 1 (x, C) = Pr(Z i A W i B Z i 1 = z) = Pr(W i B Z i = u)p r(z i du Z i 1 = z) = A Pr(W i B Z i = u)rt 1 (z, du) A Pr(W i B Z i = u)dζ T (u) A = g 1:T (v u)d(λ ζ T )(v, u), (17) A B g 1:T (v, u) = T g i (v i u i ). Since g 1:T is a strictly positive function, the last expression in (17) defines a measure on the family of product sets A B. By the extension theorem, there is a unique extension of this measure to B p T +q T. Let this measure be denoted by ξ( ). Furthermore, note that ξ( ) does not depend on x, and ξ(a B) > 0 whenever µ X (A) > 0 and the Lebesgue measure of B is positive. Thus S 1 (x, C) ξ(c) (18) 12 i=1

13 p T +q T for all x, and ξ( ) is a non-trivial measure. It follows that the entire state-space R constitutes a small set for the Markov chain ({(Z i, W i )}, and hence, by Theorem (v) of Meyn and Tweedie (1993), that {(Z i, W i )} is a uniformly ergodic Markov chain. From Theorem 3.1, we know that {Z i } has an invariant distribution ν whose marginal distributions have densities π 1,..., π T. Thus the marginal distributions of the Z-component of the invariant distribution of the Markov chain {Z i, W i } are also π 1,..., π T. Now suppose that for some i, Z i,t 1 has density π t 1 ( ). It follows from equations (3) and (6), and Steps 1 and 2 of Procedure 2.2, that W i,t must have density q t ( ). Thus the marginal distributions of the W -component of the invariant distribution of {Z i, W i } must have densities q 1,..., q T. This completes the proof. Lemma A.2 Suppose that the conditions of Theorem 3.3 hold. Let µ i,t denote the distribution of W i,t, and µ t denote the corresponding marginal distribution of the limiting distribution of the Markov chain {(Z i, W i )}. Then for almost all y R q, we have 1 lim i h q K i ( y u h i ) dµ i,t (u) = q t (y). Proof: For convenience, define K i (x) = h q i K(x/h i ). Write K i (y u)dµ i,t = K i (y u)dµ t (u) + K i (y u)dν i,t (u), where ν i,t = µ i,t µ t. Taking, limits on both sides, and applying Lemma 3.1 of Masry and Györfi (1987) (since µ t has density q t ), we get, for almost all y R q, lim K i (y u)dµ i,t = q t (y) + lim K i (y u)dν i,t (u). (19) i i We can bound the last term in this expression as follows. Let k denote the upper bound for K( ). Then K i (y u)dν i,t (u) = 1 h q K((y u)/h i )dν i,t (u) i 1 h q k dν i,t (u) i = k h q µ i,t µ t, (20) i where denotes the total variation norm. But by Theorem of Meyn and Tweedie (1993), since {(Z i, W i )} is a uniformly ergodic Markov chain, the total variation distance 13

14 in (20) is bounded (c.f. (16)) by a sequence converging geometrically to zero, that is, for some constants c > 0, 0 < ρ < 1, k h q i µ i,t µ t k h q i cρ i = k cρ i 0, as i. (21) h 0 i α Combining (19), (20), and (21), we get the desired result. (Note that for the absolute error term in (21) to converge to zero, we require that the Markov chain converge in total variation norm faster than the bandwidths h i shrink to zero.) The following result can be regarded as an extension of a special case of Theorem 2.1 of Masry and Györfi (1987). Lemma A.3 Suppose that the conditions of Theorem 3.3 hold. Then for t = 1, 2,..., T, ( ) n (1 αq)/2 ] [ˆq (n) (log n)(log 2 n) 1+δ t (y) E ˆq (n) t (y) 0 almost surely, for almost all y R q, for any δ > 0. Proof: By Theorem of Meyn and Tweedie (1993), the sequence {W i,t, i = 1, 2,...} is asymptotically uncorrelated. The desired result then follows directly from an application of Theorem 2.1 of Masry and Györfi (1987) (using, in their notation, u 2 n = [(log n)(log 2 n)] 1 ). The fact that the theorem still applies in our case, where samples are not identically distributed, but instead only have distributions converging geometrically in total variation distance, can be established by replicating the proof given in Masry and Györfi (1987), replacing use of their Lemma 3.1 with our Lemma A.2. We are now in a position to give the proof of the main convergence result. Proof of Theorem 3.3: Furthermore, since ˆq (n) t follows from Lemma A.2 that For fixed t {1,..., T }, by Lemma A.3, ˆq (n) t (y) E[ˆq (n) t (y)] 0, as n, for almost all y R q. (y) = n 1 n i=1 K i(y W i,t ) and µ i,t is the distribution of W i,t, it E ˆq (n) t (y) q t (y). Thus ˆq (n) t (y) must converge almost surely, for almost all y R q, to q t (y), and consequently log(ˆq (n) t (y)) a.s. log(q t (y)), for almost all y R q. The desired result then follows directly upon examination of definitions (5)and (10). 14

15 References C. Andrieu and A. Doucet. Online expectation-maximization type algorithms. In Proc. IEEE ICASSP, T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. Econometrics, 31: , G. Box and G. Jenkins. Time Series Analysis, Forecasting and Control. Holden-Day, A. Brockwell, A. Rojas, and R. Kass. Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91: , P.J. Brockwell and R.A. Davis. Introduction to Time Series and Forecasting. Springer, second edition, B.P. Carlin, N.G. Polson, and D.S. Stoffer. A Monte Carlo approach to nonnormal and nonlinear state-space modeling. Journal of the American Statistical Association, 87(418): , ISSN A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. Springer, New York, R.F. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of uk inflation. Econometrica, 50: , E. Ghysels, A.C. Harvey, and E. Renault. Stochastic volatility. In Handbook of Statistics, volume 14. Elsevier, N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non-gaussian Bayesian state estimation. IEE Proc. F, 140: , R.E. Kalman. A new approach to linear prediction and filtering problems. Journal of Basic Engineering (ASME), 82D:35 45, G. Kitagawa. Monte Carlo filter and smoother for non-gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1 25, E. Masry and L. Györfi. Strong consistency and rates for recursive probability density estimators of stationary processes. Journal of Multivariate Analysis, 22:79 93, S.P. Meyn and R.L. Tweedie. Markov Chains and Stochastic Stability. Springer, D.W. Scott. Multivariate Density Estimation. Wiley, R.H. Shumway and D.S. Stoffer. Time Series Analysis and its Applications. Springer,

16 J.S. Taylor. Modeling stochastic volatility: a review and comparative study. Mathematics of Finance, 4: , E.J. Wegman and H.I. Davies. Remarks on some recursive estimators of a probability density. Annals of Statistics, 7(2): , M. West and J. Harrison. Bayesian Forecasting and Dynamic Models. Springer, New York, second edition, C.T. Wolverton and T.J. Wagner. Asymptotically optimal discriminant functions for pattern classification. IEEE Transactions on Information Theory, IT-15: , H. Yamato. Sequential estimation of a continuous probability density function and mode. Bull. Math. Statistics, 14:1 12, B. Yu. Estimating the L 1 error of kernel estimators for Markov sampler. Technical Report 409, Dept. of Statistics, U.C. Berkeley,

Introduction. log p θ (y k y 1:k 1 ), k=1

Introduction. log p θ (y k y 1:k 1 ), k=1 ESAIM: PROCEEDINGS, September 2007, Vol.19, 115-120 Christophe Andrieu & Dan Crisan, Editors DOI: 10.1051/proc:071915 PARTICLE FILTER-BASED APPROXIMATE MAXIMUM LIKELIHOOD INFERENCE ASYMPTOTICS IN STATE-SPACE