STABILITY AND UNIFORM APPROXIMATION OF NONLINEAR FILTERS USING THE HILBERT METRIC, AND APPLICATION TO PARTICLE FILTERS 1

Similar documents
A Backward Particle Interpretation of Feynman-Kac Formulae

Particle Filters: Convergence Results and High Dimensions

THEOREMS, ETC., FOR MATH 515

9 Radon-Nikodym theorem and conditioning

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

MATHS 730 FC Lecture Notes March 5, Introduction

Brownian Motion and Conditional Probability

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

Concentration inequalities for Feynman-Kac particle models. P. Del Moral. INRIA Bordeaux & IMB & CMAP X. Journées MAS 2012, SMAI Clermond-Ferrand

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

An inverse of Sanov s theorem

Consistency of the maximum likelihood estimator for general hidden Markov models

1 Stochastic Dynamic Programming

Sequential Monte Carlo Methods for Bayesian Computation

generalized modes in bayesian inverse problems

Invariant measures for iterated function systems

Laplace s Equation. Chapter Mean Value Formulas

Introduction. log p θ (y k y 1:k 1 ), k=1

ANALYSIS QUALIFYING EXAM FALL 2017: SOLUTIONS. 1 cos(nx) lim. n 2 x 2. g n (x) = 1 cos(nx) n 2 x 2. x 2.

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

SUPPLEMENT TO PAPER CONVERGENCE OF ADAPTIVE AND INTERACTING MARKOV CHAIN MONTE CARLO ALGORITHMS

Auxiliary Particle Methods

Computer Intensive Methods in Mathematical Statistics

The Heine-Borel and Arzela-Ascoli Theorems

3 Integration and Expectation

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Contraction properties of Feynman-Kac semigroups

On Convergence of Recursive Monte Carlo Filters in Non-Compact State Spaces

Mean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral

A new class of interacting Markov Chain Monte Carlo methods

LECTURE 15: COMPLETENESS AND CONVEXITY

Overview of normed linear spaces

Real Analysis Problems

MTH 404: Measure and Integration

CONDITIONAL ERGODICITY IN INFINITE DIMENSION. By Xin Thomson Tong and Ramon van Handel Princeton University

The Gaussian free field, Gibbs measures and NLS on planar domains

THE INVERSE FUNCTION THEOREM

Probability and Measure

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

2 Lebesgue integration

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

Lecture 4 Lebesgue spaces and inequalities

4 Sums of Independent Random Variables

Lebesgue-Radon-Nikodym Theorem

MET Workshop: Exercises

Problem Set. Problem Set #1. Math 5322, Fall March 4, 2002 ANSWERS

x log x, which is strictly convex, and use Jensen s Inequality:

Pseudo-Poincaré Inequalities and Applications to Sobolev Inequalities

THEOREMS, ETC., FOR MATH 516

CHAPTER VIII HILBERT SPACES

Applications of Ito s Formula

Bayesian estimation of Hidden Markov Models

Spectral Gap and Concentration for Some Spherically Symmetric Probability Measures

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Ergodicity in data assimilation methods

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

L p Spaces and Convexity

Your first day at work MATH 806 (Fall 2015)

ON CONVERGENCE OF RECURSIVE MONTE CARLO FILTERS IN NON-COMPACT STATE SPACES

cappe/

Alternative Characterizations of Markov Processes

Part II Probability and Measure

EXISTENCE OF SOLUTIONS TO ASYMPTOTICALLY PERIODIC SCHRÖDINGER EQUATIONS

Reminder Notes for the Course on Measures on Topological Spaces

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Tools from Lebesgue integration

Variantes Algorithmiques et Justifications Théoriques

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

Existence and Uniqueness

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

A mathematical framework for Exact Milestoning

Chapter 1. Introduction

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

Chapter 7. Markov chain background. 7.1 Finite state space

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

MAT 571 REAL ANALYSIS II LECTURE NOTES. Contents. 2. Product measures Iterated integrals Complete products Differentiation 17

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Markov operators acting on Polish spaces

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

Aliprantis, Border: Infinite-dimensional Analysis A Hitchhiker s Guide

Random Process Lecture 1. Fundamentals of Probability

l(y j ) = 0 for all y j (1)

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Applied Analysis (APPM 5440): Final exam 1:30pm 4:00pm, Dec. 14, Closed books.

PCA sets and convexity

Wiener Measure and Brownian Motion

Bayesian Regularization

4th Preparation Sheet - Solutions

Levenberg-Marquardt method in Banach spaces with general convex regularization terms

Functional Analysis I

Advanced Monte Carlo integration methods. P. Del Moral (INRIA team ALEA) INRIA & Bordeaux Mathematical Institute & X CMAP

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Measurable functions are approximately nice, even if look terrible.

Transcription:

The Annals of Applied Probability 0000, Vol. 00, No. 00, 000 000 STABILITY AND UNIFORM APPROXIMATION OF NONLINAR FILTRS USING TH HILBRT MTRIC, AND APPLICATION TO PARTICL FILTRS 1 By François LeGland and Nadia Oudjane IRISA / INRIA Rennes and DF R&D Clamart We study the stability of the optimal filter w.r.t. its initial condition and w.r.t. the model for the hidden state and the observations in a general hidden Markov model, using the Hilbert projective metric. These stability results are then used to prove, under some mixing assumption, the uniform convergence to the optimal filter of several particle filters, such as the interacting particle filter and some other original particle filters. 1. Introduction The stability of the optimal filter has become recently an active research area. Ocone and Pardoux have proved in [26] that the filter forgets its initial condition in the L p sense, without stating any rate of convergence. Recently, a new approach has been proposed using the Hilbert projective metric. This metric allows to get rid of the normalization constant in the Bayes formula, and reduces the problem to studying the linear equation satisfied by the unnormalized optimal filter. Using the Hilbert metric, stability results w.r.t. the initial condition have been proved by Atar and Zeitouni in [4], and some stability result w.r.t. the model have been proved by Le Gland and Mevel in [19, 20], for hidden Markov models (HMM) with finite state space. The results and methods of [4] have been extended to HMM with Polish state space by Atar and Zeitouni in [3], see also Da Prato, Fuhrman and Malliavin [8]. Independently, Del Moral and Guionnet have adopted in [9], for the same class of HMM, another approach based on semi group techniques and on the Dobrushin ergodic coefficient, to derive stability results w.r.t. the initial condition, which are used to prove uniform convergence of the interacting particle system (IPS) approximation to the optimal predictor. New approaches have been proposed recently, to prove the stability of the optimal filter w.r.t. its initial condition, in the case of a noncompact state space, see e.g. Atar [1], Atar, Viens and Zeitouni [2], Budhiraja and Ocone [6, 7]. In this article, we use the approach based on the Hilbert metric to study the asymptotic behavior of the optimal filter, and to prove as in [9] the uniform convergence of several particle filters, such as the interacting particle filter (IPF) and some other original particle filters. A common assumption to prove stability results, see e.g. in [9, Theorem 2.4], is that the Markov transition kernels are mixing, which implies that the hidden state sequence is ergodic. Our results are obtained under the assumption that the nonnegative kernels describing the evolution of the unnormalized optimal filter, and incorporating simultaneously the Markov transition kernels and the likelihood functions, are mixing. This is a weaker assumption, see Proposition 3.9, which allows to consider some cases, similar to the case studied in [6], where the hidden state sequence is not ergodic, see xample 3.10. This point of view is further developped by Le Gland and Oudjane in [22] and by Oudjane and Rubenthaler in [28]. Our main contribution is to study also the stability of the optimal filter w.r.t. the model, when the local error is propagated by mixing kernels, and can be estimated in the Hilbert metric, in the total variation norm, or in a weaker distance suitable for random probability distributions. AMS 1991 subject classifications. Primary 9311, 9315, 6225; secondary 60B10, 60J27, 62G07, 62G09, 62L10. Key words and phrases. hidden Markov model, nonlinear filter, particle filter, stability, Hilbert metric, total variation norm, mixing, regularizing kernel. 1 This work was partially supported by CNRS, under the projects Méthodes Particulaires en Filtrage Non Linéaire (project number 97 N23 / 0019, Modélisation et Simulation Numérique programme), Chaînes de Markov Cachées et Filtrage Particulaire (MathSTIC programme), and Méthodes Particulaires (AS67, DSTIC Action Spécifique programme). 1

2 F. LeGland and N. Oudjane The uniform convergence of the IPS approximation to the optimal predictor is proved in [9, Theorem 3.1], under the assumption that the likelihood functions are uniformly bounded away from zero, which is rather strong, and that the predictor is asymptotically stable. The rate (1/ N) α for some α < 1 is proved under the stronger assumption that the predictor is exponentially asymptotically stable, and the rate 1/ N is proved in Del Moral and Miclo [11, page 36] under an additional assumption which is satisfied e.g. if the Markov kernels are mixing. Our uniform convergence results are obtained under the assumption that the expected values of the likelihood functions integrated against any possible predicted probability distribution, are bounded away from zero. This assumption is automatically satisfied under our weaker mixing assumption, see Remark 5.7. Motivated by practical considerations, we introduce a variant of the IPF, where an adaptive number of particles is used, based on a posteriori estimates. The resulting sequential particle filter (SPF) is shown to converge uniformly to the optimal filter, independently of any lower bound assumption on the likelihood functions. The counterpart is that the computational time is random, and that the expected number of particles does depend on the integrated lower bounds of the likelihood functions. Also motivated by practical considerations, i.e. to avoid the degeneracy of particle weights and the degeneracy of particle locations, which are two known causes of divergence of particle filters, we introduce regularized particle filters (RPF), which are shown to converge uniformly to the optimal filter. The paper is organized as follows : In the next section we define the framework of the nonlinear filtering problem and we introduce some notations. In Section 3, we state some properties of the Hilbert metric, which are used in Section 4 to prove the stability of the optimal filter w.r.t. its initial condition and w.r.t. the model. These stability results are used to prove the uniform convergence of several particle filters to the optimal filter. First, uniform convergence in the weak sense is proved in Section 5 for interacting particle filters, with a rate 1/ N, and sequential particle filters, with a random number of particles, are also considered. Finally, regularized particle filters are defined in Section 6, for which uniform convergence in the weak sense and in the total variation norm are proved. 2. Optimal filter for general HMM We consider the following model, with a hidden (non observed) state sequence {X n, n 0} and an observation sequence {Y n, n 1}, taking values in a complete separable metric space and in F = R d, respectively (in Section 6, it will be assumed that = R m ) : The state sequence {X n, n 0} is defined as an inhomogeneous Markov chain, with transition probability kernel Q n, i.e. P[X n dx X 0:n 1 = x 0:n 1 ] = P[X n dx X n 1 = x n 1 ] = Q n (x n 1, dx), for all n 1, and with initial probability distribution µ 0. For instance, {X n, n 0} could be defined by the following equation (1) X n = f n (X n 1, W n ), where {W n, n 1} is a sequence of independent random variables, not necessarily Gaussian, independent of the initial state X 0. The memoryless channel assumption holds, i.e. given the state sequence {X n, n 0} the observations {Y n, n 1} are independent random variables, for all n 1, the conditional probability distribution of Y n depends only on X n. For instance, the observation sequence {Y n, n 1} could be related to the state sequence {X n, n 0} by Y n = h n (X n, V n ),

Stability and Uniform Approximation of Nonlinear Filters 3 for all n 1, where {V n, n 1} is a sequence of independent random variables, not necessarily Gaussian, independent of the state sequence {X n, n 0}. In addition, it is assumed that for all n 1, the collection of probability distributions P[Y n dy X n = x] on F, parametrized by x, is dominated, i.e. P[Y n dy X n = x] = g n (x, y) λ F n (dy), for some nonnegative measure λ F n on F. The corresponding likelihood function is defined by Ψ n (x) = g n (x, Y n ), and depends implicitly on the observation Y n. The following notations and definitions will be used throughout the paper. The set of probability distributions on, and the set of finite nonnegative measures on, are denoted by P() and M + () respectively. The notation is used for the total variation norm on the set of signed measures on, and for the supremum norm on the set of bounded measurable functions defined on, depending on the context. With any nonnegative kernel K defined on, is associated a nonnegative linear operator denoted by K, and defined by K µ(dx ) = µ(dx) K(x, dx ), for any nonnegative measure µ M + (). With any nonnegative measure µ M + (), is associated the normalized nonnegative measure (i.e. the probability distribution) µ µ(), if µ() > 0, i.e. if µ is nonzero µ := ν, otherwise, i.e. if µ 0, where ν P() is an arbitrary probability distribution. With any nonnegative kernel K defined on, is associated the normalized nonnegative nonlinear operator K, taking values in P(), and defined for any nonzero nonnegative measure µ M + () by K(µ) := ν, otherwise, i.e. if K µ 0, K µ, if (K µ)() > 0, i.e. if K µ is nonzero (K µ)() where ν P() is an arbitrary probability distribution. Notice that K(µ) = K( µ) is nonzero by definition, hence composition of normalized nonnegative nonlinear operators is well defined. The problem of nonlinear filtering is to compute at each time n, the conditional probability distribution µ n of the state X n given the observation sequence Y 1:n = (Y 1,, Y n ) up to time n. The transition from µ n 1 to µ n is described by the following diagram µ n 1 prediction µ n n 1 = Q n µ n 1 correction µ n = Ψ n µ n n 1 = Ψ n µ n n 1 µ n n 1, Ψ n, where denotes the projective product. In general, no explicit expression is available for the Markov kernel Q n, or it is so complicated that computing integrals such as µ n n 1 (dx ) = Q n µ n 1 (dx ) = µ n 1 (dx) Q n (x, dx ),

4 F. LeGland and N. Oudjane is practically impossible. Instead, throughout this paper we assume that for any x, simulating a r.v. with probability distribution Q n (x, dx ) is easy (this is the case for instance if (1) holds). Remark 2.1. Notice that the normalizing constant µ n n 1, Ψ n is a.s. positive, hence the projective product Ψ n µ n n 1 is well defined. Indeed P[Y n dy Y 1:n 1 ] = P[Y n dy X n = x] P[X n dx Y 1:n 1 ] hence Therefore µ n n 1, Ψ n = P[ µ n n 1, Ψ n = 0 Y 1:n 1 ] = = [ g n (x, y) µ n n 1 (dx) ] λ F n (dy) = l n (y) λ F n (dy), g n (x, Y n ) µ n n 1 (dx) = l n (Y n ). F 1 {ln (y) = 0} l n(y) λ F n (dy) = 0. Remark 2.2. Notice also that, for any test function ψ defined on F ψ(y n ) [ µ n n 1, Ψ n Y 1:n 1] = [ ψ(y n) l n (Y n ) Y 1:n 1] = ψ(y) λ F n (dy). In particular, if ψ(y) = g n (x, y), then ψ(y n ) = Ψ n (x), and Ψ n (x) [ µ n n 1, Ψ n Y 1:n 1] = g n (x, y) λ F n (dy) = 1, for any x. F F For any n 1, we introduce the nonnegative kernel (2) R n (x, dx ) = Q n (x, dx ) Ψ n (x ), and the associated nonnegative linear operator R n = Ψ n Q n on M + (), defined by R n µ(dx ) = µ(dx) Q n (x, dx ) Ψ n (x ), for any µ M + (). Notice that R n depends on the observation Y n through the likelihood function Ψ n. With this definition, (R n µ n 1 )() = µ n n 1, Ψ n is a.s. positive, the evolution of the optimal filter can be written as follows (3) µ n = Ψ n (Q n µ n 1 ) = and iteration yields R n µ n 1 (R n µ n 1 )() = R n (µ n 1 ), µ n = R n (µ n 1 ) = R n R m (µ m 1 ) = R n:m (µ m 1 ). quation (3) shows clearly that the evolution of the optimal filter is nonlinear only because of the normalization term coming from the Bayes rule. In the following section a projective metric is introduced precisely to get rid of the normalization and to come down to the analysis of a linear evolution.

Stability and Uniform Approximation of Nonlinear Filters 5 Remark 2.3. The model considered here is slightly different from the model considered in other works, see [11] and references therein, where it is assumed that an observation Y 0 is already available at time 0, and where the object of study is rather the conditional probability distribution η n of the state X n given the observation sequence Y 0:n 1 = (Y 0,, Y n 1 ) up to time (n 1). With our notations, the evolution of the optimal predictor in this alternate model can be written as follows and iteration yields η n+1 = Q n+1 (Ψ n η n ), (4) η n+1 = Q n+1 Rn R m (Ψ m 1 η m 1 ) = Q n+1 Rn:m (Ψ m 1 η m 1 ) = Q n+1 η n, with initial condition η 0 = µ 0. 3. Hilbert metric on the set of finite nonnegative measures In this section we recall the definition of the Hilbert metric and its associated contraction coefficient, the Birkhoff contraction coefficient. We introduce also a mixing property for nonnegative kernels, and we state some properties relating the Hilbert metric with other distances on the set of probability distributions, e.g. the total variation norm, or a weaker distance suitable for random probability distributions. In the last part of the section, these definitions and properties are specialized to the optimal filtering context. Definition 3.1. Two nonnegative measures µ, µ M + () are comparable, if they are both nonzero, and if there exist positive constants 0 < a b, such that for any Borel subset A. a µ (A) µ(a) b µ (A), Definition 3.2. The nonnegative kernel K defined on is mixing, if there exist a constant 0 < ε 1, and a nonnegative measure λ M + (), such that for any x, and any Borel subset A. ε λ(a) K(x, A) 1 ε λ(a), Definition 3.3. The Hilbert metric on M + () is defined by µ(a) sup A : µ log (A)>0 µ (A), if µ and µ µ(a) are comparable, inf h(µ, µ A : µ ) := (A)>0 µ (A) 0, if µ = µ 0, +, otherwise. Notice that the two nonnegative measures µ and µ are comparable if and only if µ and µ are equivalent, with Radon Nikodym derivatives dµ dµ and dµ bounded and bounded away from zero, and then the dµ following equality holds (5) h(µ, µ ) = log[ sup A : µ (A)>0 µ(a) µ (A) sup A : µ(a)>0 µ (A) dµ ] = log( µ(a) dµ dµ dµ ). Moreover h is a projective distance, i.e. it is invariant under multiplication by positive scalars, hence the Hilbert distance between two unnormalized nonnegative measures is the same as the Hilbert distance between the two corresponding normalized measures : h(µ, µ ) = h( µ, µ ), for any nonzero µ, µ M + ().

6 F. LeGland and N. Oudjane In the nonlinear filtering context, this property will allow us to consider the linear transformation µ R n µ instead of the nonlinear transformation µ R n (µ) = R n µ/(r n µ)(). This projective property does not hold for others distances. Indeed, the following estimates show how the error between two unnormalized nonnegative measures can be used to bound the error between the two corresponding normalized measures. If µ = µ 0, then µ = µ = ν, hence µ µ 0. If both µ and µ are nonzero, then µ µ = 1 µ() [ µ µ (µ() µ ()) µ ], hence (6) µ µ, φ µ µ, φ µ() and (7) µ µ µ µ µ() Finally, if µ is nonzero and µ 0, then hence µ µ, φ + µ() µ () µ() + µ() µ () µ() µ µ = µ µ() ν,. φ, µ, φ µ() + φ and µ µ 2, i.e. estimates (6) and (7) still hold (notice that the bounds in estimates (6) and (7) do not depend on the restarting probability distribution ν). The following two lemmas give several useful relations between the Hilbert metric, the total variation norm and a weaker distance suitable for random probability distributions. Lemma 3.4. For any µ, µ M + () (8) µ µ 2 log 3 h(µ, µ ). If the nonnegative kernel K defined on is mixing, then for any nonzero µ, µ M + () (9) h(k µ, K µ ) 1 ε 2 µ µ. Proof of Lemma 3.4. If µ = µ 0, then µ = µ = ν hence µ µ = 0, while h(µ, µ ) = 0 by definition. If µ is nonzero and µ 0, then h(µ, µ ) = by definition. Finally, if both µ and µ are nonzero, the proof of the first inequality can be found in Atar and Zeitouni [3]. To prove the second inequality, notice first that, for any comparable µ, µ M + () h(µ, µ ) = log sup A : µ (A)>0 sup A : µ (A)>0 µ(a) µ + log sup (A) A : µ(a)>0 µ(a) µ (A) µ (A) + sup A : µ(a)>0 µ (A) µ(a) µ(a) µ (A) µ(a) since log(1 + x) x. In order to apply this bound to h(k µ, K µ ) = h(k µ, K µ ), we notice that K µ and K µ are comparable for any nonzero µ, µ M + (), since K is mixing, and we introduce (A) = (K µ)(a) (K µ )(A) = ( µ µ )(dx) Φ(x, A) (K µ)(a) = ( µ µ ) + (dx) Φ(x, A) ( µ µ ) (dx) Φ(x, A),,

where Stability and Uniform Approximation of Nonlinear Filters 7 Φ(x, A) = K(x, A) (K µ)(a) 1 ε 2, for any x and any Borel subset A, using the mixing property. By the Scheffe theorem ( µ µ ) + (dx) = ( µ µ ) (dx) = 1 2 µ µ, hence if (A) is positive, then (A) ( µ µ ) + (dx) Φ(x, A) 1 2ε 2 µ µ, and similarly, if (A) is negative, then (A) ( µ µ ) (dx) Φ(x, A) 1 2ε 2 µ µ. Lemma 3.5. If the nonnegative kernel K defined on is dominated, i.e. if there exist a constant c > 0, and a nonnegative measure λ M + (), such that for any x, and any Borel subset A, then K(x, A) c λ(a), K µ K µ c λ() for any µ, µ M + (), possibly random. sup µ µ, φ, Remark 3.6. If the nonnegative kernel K is mixing, then it is dominated, with the same nonnegative measure λ M + (), and with c = 1/ε. Remark 3.7. If in addition the nonnegative kernel K is F measurable, then the same estimate holds for conditional expectations w.r.t. F, i.e. (10) [ K µ K µ F ] c λ() sup [ µ µ, φ F ]. Proof of Lemma 3.5. By definition, if K is dominated, then K(x, ) is absolutely continuous w.r.t. λ, with Radon Nikodym derivative k(x, ) bounded by c, for any x. Therefore, the total variation norm Kµ Kµ can be written as an integral as follows K µ K µ = (µ µ )(dx) k(x, x ) λ(dx ), hence, taking expectation yields K µ K µ = Lemma 3.8. (µ µ )(dx) k(x, x ) λ(dx ) sup µ µ, φ [ sup k(x, x ) ] λ(dx ). x The nonnegative kernel K defined on, is a contraction under the Hilbert metric, and (11) τ(k) := sup 0<h(µ,µ )< h(k µ, K µ ) h(µ, µ ) = tanh[ 1 4 H(K)],

8 F. LeGland and N. Oudjane where the supremum in H(K) := sup µ,µ h(k µ, K µ ), is over nonzero nonnegative measures : τ(k) is called the Birkhoff contraction coefficient. The proof can be found in Birkhoff [5, Chapter XVI, Theorem 3] or in Hopf [17, Theorem 1]. Notice that H(K) < implies τ(k) < 1. Returning to the filtering problem introduced in Section 2, stability results stated in the following sections will in general require that for any n 1, the nonnegative kernel R n is mixing, i.e. there exist a constant 0 < ε n 1, and a nonnegative measure λ n M + (), such that ε n λ n (A) R n (x, A) 1 ε n λ n (A), for any x, and any Borel subset A. Notice that in full generality ε n and λ n depend on the observation Y n, hence are random variables. Proposition 3.9. The nonnegative kernel R n defined in (2) is a contraction under the Hilbert metric, with Birkhoff contraction coefficient τ n = τ(r n ) 1. Moreover (i) If R n is mixing, with the possibly random constant ε n, then τ n 1 ε2 n 1 + ε 2 < 1. n (ii) If the Markov transition kernel Q n is mixing, with the nonrandom constant ε n, then R n is also mixing, with the same constant ε n, without any condition on the likelihood function Ψ n, and τ n τ(q n ) 1 ε2 n 1 + ε 2 < 1. n Throughout the paper, for any integers m n, the contraction coefficient of the product R n:m = R n R m is denoted by τ n:m = τ(r n:m ) τ n τ m and by convention τ n:n+1 = τ m 1:m = 1. Proof of Proposition 3.9. the Hilbert metric. It follows immediately from Lemma 3.8 that R n is a contraction under If R n is mixing, then for any nonzero µ, µ M + (), and any Borel subset A ε 2 n R n µ (A) µ () ε n λ n (A) R n µ(a) µ() hence R n µ and R n µ are comparable. Using equation (5) yields 1 ε n λ n (A) 1 ε 2 n R n µ (A) µ () H(R n ) = sup µ,µ h(r n µ, R n µ ) = sup µ,µ log( d(r n µ) d(r n µ ) d(r n µ ) d(r n µ) ) log 1 ε 4 n where the supremum is taken over nonzero nonnegative measures. Then using Lemma 3.8 yields which ends the proof of (i). τ n = τ(r n ) = tanh[ 1 4 H(R n)] tanh(log 1 ) = 1 ε2 n ε n 1 + ε 2 < 1, n If Q n is mixing, then R n = Ψ n Q n is also mixing, since ε n Ψ n (x ) λ n (dx ) R n (x, A) 1 Ψ n (x ) λ n (dx ), ε n A A,,

Stability and Uniform Approximation of Nonlinear Filters 9 for any x, and any Borel subset A, hence for any nonzero µ, µ M + (), R n µ and R n µ are comparable, with Radon Nikodym derivative d(r n µ) d(r n µ ) (x ) = d(q n µ) d(q n µ ) (x ) 1 {Ψn (x ) > 0} d(q n µ) d(q n µ ) (x ), for any x, and similarly with interchanging the role of µ and µ. Therefore H(R n ) sup µ,µ log( d(q n µ) d(q n µ ) d(q n µ ) d(q n µ) ) = H(Q n) log 1 ε 4 n where the supremum is taken over nonzero nonnegative measures. Then using again Lemma 3.8, yields τ n = τ(r n ) τ(q n ). The assumption that the nonnegative kernel R n is mixing is much weaker than the usual assumption that both the Markov kernel Q n is mixing and the likelihood function Ψ n is bounded away from zero. Indeed, if the Markov kernel Q n is mixing, then it follows from (ii) that the nonnegative kernel R n is mixing, without any assumption on the likelihood function Ψ n : in particular, the likelihood function could take the zero value, or even be compactly supported. This is not a necessary condition however, as illustrated by the example below, where the Markov kernel Q n is not mixing, but the nonnegative kernel R n is (equivalent, in a sense to be defined below, to) a mixing kernel. xample 3.10. Assume that µ 0 has compact support C 0, and that for any n 1, the function Ψ n has compact support C n, and the transition probability kernel Q n is defined by Q n (x, dx ) = (2 π) m/2 exp{ 1 2 x f n (x) 2 } dx = q n (x, x ) λ(dx ), where the function f n is continuous, and where λ(dx ) = (2 π) m/2 exp{ 1 2 x 2 } dx. Clearly, the Markov kernel Q n is not mixing, but introducing n 1 = which are both finite a.s., it holds sup f n (x) and n = sup x, x C n 1 x C n (12) exp{ n 1 n 2 n 1} q n (x, x ) exp{ n 1 n}, for any x C n 1 and any x C n. Define as usual, and R n (x, dx ) = Q n (x, dx ) Ψ n (x ), R n(x, dx ) = 1 {x Cn 1 } R n(x, dx ) + 1 {x Cn 1 } Ψ n(x ) λ(dx ), = [1 {x Cn 1 } q n(x, x ) + 1 {x Cn 1 } ] Ψ n(x ) λ(dx ). Notice first that the sequence {µ n, n 0} defined by (3) satisfies also (13) µ n = R n µ n 1 (R n µ n 1 )(). Moreover, it follows from (12) that exp{ n 1 n 2 n 1} Ψ n (x ) λ(dx ) R n(x, A) exp{ n 1 n} A A Ψ n (x ) λ(dx ), for any x, and any Borel subset A, i.e. the nonnegative kernel R n is mixing. Therefore, stability and approximation properties of the sequence {µ n, n 0} defined by (3), can be obtained directly by studying (13) instead, which involves mixing kernels.

10 F. LeGland and N. Oudjane 4. Stability of nonlinear filters In practice one has rarely access to the initial distribution of the hidden state process, hence it is important to study the stability of the filter w.r.t. its initial condition. Moreover, the answer to this question will be useful to study the stability of the filter w.r.t. the model. Let µ n denote the filter initialized with the correct µ 0, and let µ n denote the filter initialized with a wrong µ 0, i.e. µ n = R n:1 (µ 0 ) and µ n = R n:1 (µ 0). We are interested in the total variation error at time n induced by the initial error. Theorem 4.1. Without any assumption on the nonnegative kernels, the following inequality holds µ n µ n 2 log 3 τ n:m h(µ m 1, µ m 1). If in addition the nonnegative kernel R m is mixing, then µ n µ n 2 log 3 τ n:m+1 1 ε 2 m µ m 1 µ m 1. Corollary 4.2. If for any k 1, the nonnegative kernel R k is mixing with ε k ε > 0, then convergence holds uniformly in time, i.e. µ n µ n Proof of Theorem 4.1. yields 2 ε 2 log 3 τ n m µ m 1 µ m 1 with τ = 1 ε2 1 + ε 2 < 1. Using (8), and the definition (11) of the Birkhoff contraction coefficient, (14) R n:m (µ) R n:m (µ ) 2 log 3 h(r n:m µ, R n:m µ ) 2 log 3 τ n:m h(µ, µ ), for any µ, µ P(). If the nonnegative kernel R m is mixing, then using (9) yields (15) R n:m (µ) R n:m (µ ) 2 log 3 h(r n:m+1 R m µ, R n:m+1 R m µ ) Taking µ = µ m 1 and µ = µ m 1 finishes the proof. 2 log 3 τ n:m+1 h(r m µ, R m µ ) 2 log 3 τ n:m+1 1 ε 2 m µ µ. To solve the nonlinear filtering problem, one must have a model to describe the state / observation system, {X n, n 0}, {Y n, n 1}, as presented in Section 2. The general hidden Markov model is based on the initial condition µ 0, on the transition kernels Q n and on the likelihood functions Ψ n, which define the evolution operator R n for the optimal filter µ n. But, as for the initial condition, in practice one has rarely access to the true model. In particular, the prior information on the state sequence is in general unknown and the choice of Q n is approximative. Similarly, the probabilistic relation between the observation and the state is in general unknown and the choice of Ψ n is also approximative. As a result, instead of using the true model, it is common to work with a wrong model, based on a wrong transition kernel Q n and a wrong likelihood function Ψ n, which define the evolution operator R n for a wrong filter µ n. Another situation is when the evolution operator R n is known, but difficult to compute. For the purpose of practical implementation, one constructs an approximate filter µ n such that the evolution µ n 1 µ n is easy to compute and close to the true evolution µ n 1 R n (µ n 1). We are interested in bounding the global error between µ n and µ n induced by the local errors committed at each time step. We suppose here that µ 0 = µ 0, since the problem of a wrong initialization has already been studied above. In full generality, we assume that {µ n, n 0} is a random sequence with values in

Stability and Uniform Approximation of Nonlinear Filters 11 P(), satisfying the following property : for any n k 1 and for any bounded measurable function F defined on P() (16) [F (µ k) Y 1:n ] = [F (µ k) Y 1:k ]. The results stated below are based on the following decomposition of the global error into a sum of local errors transported by a sequence of normalized evolution operators n (17) µ n µ n = [ R n:k+1 (µ k) R n n:k (µ ) ] = [ R n:k+1 (µ k) R n:k+1 R k (µ ) ]. k=1 This equation shows the close relation between the stability w.r.t. the initial condition and the stability w.r.t. the model. Let us consider first the case where we can estimate the local error in the sense of the Hilbert metric. k=1 Assumption H (local error bound in the Hilbert metric) : δ H k := [ h(µ k, R k (µ )) Y 1:k ] <. Remark 4.3. If the evolution of the wrong filter µ k is defined by the nonnegative kernel R k (x, dx ) = Q k (x, dx ) Ψ k (x ), and if Q k (x, dx ) = q k (x, x ) λ k (dx ) and Q k(x, dx ) = q k(x, x ) λ k (dx ), then a sufficient condition for Assumption H to hold is that there exist constants δ k 0 and a k > 0, such that a k Ψ k(x ) q k (x, x ) Ψ k (x ) q k (x, x ) a k exp(δ k ), for all x, x, in which case δ H k δ k. Theorem 4.4. If for any k 1, Assumption H holds, then (18) [ µ n µ n Y 1:n ] 2 n τ n:k+1 δk H. log 3 Corollary 4.5. If for any k 1, the nonnegative kernel R k is mixing with ε k ε > 0, and Assumption H holds with δk H δ, then convergence holds uniformly in time, i.e. (19) [ µ n µ 2 n Y 1:n ] ε 2 log 3 δ. k=1 Indeed, (19) follows from n k=1 τ n k = 1 τ n 1 τ 1 1 τ = 1 + ε2 2 ε 2 1 ε 2. Proof of Theorem 4.4. Using the decomposition (17), the triangle inequality, and estimate (14), yields n µ n µ n R n:k+1 (µ k) R n:k+1 R k (µ ) 2 n τ n:k+1 h(µ log 3 k, R k (µ )). k=1 Taking conditional expectation w.r.t. the observations and using (16), yields (18). Let us consider next the case where we can estimate the local error in the sense of the total variation norm k=1 δ TV k := [ µ k R k (µ ) Y 1:k ] 2.

12 F. LeGland and N. Oudjane Theorem 4.6. If for any k 1, the nonnegative kernel R k is mixing, then (20) [ µ n µ n Y 1:n ] δ TV n + 2 log 3 δ TV k n 1 k=1 δk TV τ n:k+2 ε 2 k+1 Corollary 4.7. If for any k 1, the nonnegative kernel R k is mixing, with ε k ε > 0, and δ, then convergence holds uniformly in time, i.e. Proof of Theorem 4.6. [ µ n µ n Y 1:n ] ( 1 + The decomposition (17) is written as 2 ε 4 log 3 ) δ. (21) µ n µ n = [ µ n R n 1 n (µ n 1) ] + [ R n:k+1 (µ k) R n:k+1 R k (µ ) ], k=1 hence using the triangle inequality and estimate (15), yields µ n µ n µ n R n 1 n (µ n 1) + R n:k+1 (µ k) R n:k+1 R k (µ ) k=1 µ n R n (µ n 1) + 2 log 3 n 1 k=1 1 τ n:k+2 ε 2 µ k R k (µ ). k+1 Taking conditional expectation w.r.t. the observations and using (16), yields (20). Let us consider finally the case where we can only estimate the local error in the weak sense δk W := sup [ µ k R k (µ ), φ Y 1:k ] 2. This typically happens if the approximate filter µ k is an empirical probability distribution associated with R k (µ ) : in this case, bounding the local error requires to use the law of large numbers, which can only provide estimates in the weak sense. However, if the nonnegative kernel R k+1 is dominated, then using Lemma 3.5, the local error transported by R k+1 can be bounded in total variation with the same precision δk W as in the weak sense.. Theorem 4.8. If for any k 1, the nonnegative kernel R k is mixing, then (22) sup [ µ n µ n, φ Y 1:n ] δn W + 2 δw n 1 ε 2 + 4 n log 3 n 2 δk W τ n:k+3 ε 2 k=1 k+2 ε2 k+1. Corollary 4.9. If for any k 1, the nonnegative kernel R k is mixing with ε k ε > 0, and δ W k δ, then convergence holds uniformly in time, i.e. sup [ µ n µ n, φ Y 1:n ] ( 1 + 2 ε 2 + 4 ε 6 log 3 ) δ. Proof of Theorem 4.8. Using the decomposition (21) and the triangle inequality, yields µ n µ n, φ µ n R n (µ n 1), φ (23) R n:k+1 (µ k) R n:k+1 R k (µ ) φ. n 1 + k=1

Stability and Uniform Approximation of Nonlinear Filters 13 For any 1 k n 2, using estimate (15) yields R n:k+1 (µ k) R n:k+1 R k (µ ) = R n:k+2 R k+1 (µ k) R n:k+2 R k+1 R k (µ ) For any 1 k n 1, using estimate (7) yields and the mixing property yields 2 log 3 τ n:k+3 1 ε 2 k+2 R k+1 (µ k) R k+1 R k (µ ). R k+1 (µ k) R k+1 R k (µ ) 2 R k+1 (µ k R k (µ )) (R k+1 µ k )(), (R k+1 µ k)() ε k+1 λ k+1 (). Taking conditional expectation w.r.t. the observations, using estimate (10) with K = R k+1, µ = R k (µ ), µ = µ k and F = Y 1:n, and using (16), yields [ R k+1 (µ k R k (µ )) Y 1:n ] λ k+1() ε k+1 Combining these estimates yields λ k+1() ε k+1 δ W k. sup [ µ k R k (µ ), φ Y 1:n ] [ R k+1 (µ k) R k+1 R k (µ ) Y 1:n ] 2 δw k ε 2 k+1. Finally, taking conditional expectation w.r.t. the observations in (23), yields (22). 5. Uniform convergence of interacting particle filters In this section and in the next section, we consider again the framework introduced in Section 4, but now the wrong model is chosen deliberately, such that the wrong filter can easily be computed, and remains close to the optimal filter. More specifically, we are interested in particle methods to approximate numerically the optimal filter, and we provide estimates of the approximation error. The idea common to all particle filters is to generate an N sample (ξn n 1 1,, ξn n n 1 ) of i.i.d. random variables, called a particle system, with common probability distribution Q n µ N n 1, where µ N n 1 is an approximation of µ n 1, and to use the corresponding empirical probability distribution µ N n n 1 = 1 N δ ξ i, n n 1 as an approximation of µ n n 1 = Q n µ n 1. The method is very easy to implement, even in high dimensional problems, since it is sufficient in principle to simulate independent samples of the hidden state sequence. A major and earliest contribution in this field was made by Gordon, Salmond and Smith [15], which proposed to use sampling / importance resampling (SIR) techniques in the correction step : the positive effect of the resampling step is to automatically select particles with larger values of the likelihood function, i.e. to concentrate particles in regions of interest of the state space. A very complete account of the currently available mathematical results can be found in the survey paper by Del Moral and Miclo [11]. Theoretical and practical aspects can be found in the volume edited by Doucet, de Freitas and Gordon [14].

14 F. LeGland and N. Oudjane 5.1. Notations and preliminary results Throughout the paper, S N (µ) is a shorthand notation for the empirical probability distribution of an N sample with probability distribution µ, i.e. S N (µ) := 1 δ N ξ i with (ξ 1,, ξ N ) i.i.d. µ. Lemma 5.1. For any µ P() sup S N (µ) µ, φ 1. N Proof of Lemma 5.1. hence It holds S N (µ) µ, φ = 1 N [φ(ξ i ) µ, φ ], S N (µ) µ, φ 2 = 1 N [ µ, φ2 µ, φ 2 ] 1 N φ 2. Remark 5.2. If in addition φ and µ are F measurable r.v. s, and if conditionally w.r.t. F the r.v. s (ξ 1,, ξ N ) are i.i.d. with (conditional) probability distribution µ, then the same estimate holds for conditional expectation w.r.t. F, i.e. (24) [ S N (µ) µ, φ F] 1 φ. N For any nonnegative and bounded measurable function Λ defined on, and any probability distribution µ defined on, the projective product Λ µ is defined by Λ µ, if µ, Λ > 0, µ, Λ Λ µ := ν, otherwise, where ν is an arbitrary probability distribution defined on. If µ, Λ > 0, it follows immediately from Lemma 5.1, and using estimate (6), that sup Λ S N (µ) Λ µ, φ 2 sup SN (µ) µ, Λ φ 2 sup Λ(x) x µ, Λ N µ, Λ Remark 5.3. If in addition φ, Λ and µ are F measurable r.v. s, and if conditionally w.r.t. F the r.v. s (ξ 1,, ξ N ) are i.i.d. with (conditional) probability distribution µ, then the same estimate holds for conditional expectation w.r.t. F, i.e. (25) [ Λ S N (µ) Λ µ, φ F] 2 sup Λ(x) x φ. N µ, Λ The following procedure, classical in sequential analysis, can be used alternatively. Lemma 5.4. Let µ P(), and let Λ be a nonnegative bounded measurable function defined on, such that µ, Λ > 0. For any δ > 0, define the stopping time T = inf{n : δ 2 N Λ(ξ i ) sup Λ(x)} with (ξ 1,, ξ N, ) i.i.d. µ. x.

Then Stability and Uniform Approximation of Nonlinear Filters 15 sup Λ S T (µ) Λ µ, φ 2 δ 1 + δ 2. To obtain an error estimate of order O(δ), the expected sample size should be of order O(1/δ 2 ), i.e. ρ δ 2 [T ] ρ δ 2 (1 + δ2 ), sup Λ(x) x where ρ = µ, Λ. The method proposed here to approximate the posterior probability distribution Λ µ is somehow intermediate, between the classical importance sampling method, which uses a fixed number of random variables, and the acceptance / rejection method, which requires a random number of random variables. In Lemma 5.4, the number of random variables generated is random as in the acceptance / rejection method, but there is no rejection, since all the random variables generated are explicitly used in the approximation, as in the importance sampling method. Proof of Lemma 5.4. Notice first that a.s. S N (µ), Λ = 1 N Λ(ξ i ) µ, Λ > 0, as N, hence the stopping time T is a.s. finite. By definition, S T (µ), Λ > 0, hence using estimate (6) yields and we define sup Λ S T (µ) Λ µ, φ 2 sup ST (µ) µ, Λ φ S T, (µ), Λ M N = [ Λ(ξ i ) φ(ξ i ) µ, Λ φ ] and D N = Λ(ξ i ). By definition of the stopping time T, it holds λ δ 2 D T = D T 1 + Λ(ξ T ) λ δ 2 + λ = λ δ 2 (1 + δ2 ), where λ = sup Λ(x), and the Cauchy Schwarz inequality yields x M T D T δ2 λ ([M 2 T ]) 1/2. In addition, for any a > 0 P(T > N) exp{a λ δ 2 } rn where r = exp{ a Λ(x)} µ(dx) < 1, hence the stopping time T is integrable. It follows from the Wald identity, see e.g. Neveu [25, Proposition IV 4 21], that and hence λ δ 2 [D T ] = [T ] µ, Λ λ δ 2 (1 + δ2 ), [M 2 T ] = [T ] [ µ, Λ 2 φ 2 µ, Λ φ 2 ] [T ] µ, Λ λ φ 2 λ2 δ 2 (1 + δ2 ) φ 2, ST (µ) µ, Λ φ S T (µ), Λ = M T D T δ 1 + δ 2 φ and ρ δ 2 [T ] ρ δ 2 (1 + δ2 ).

16 F. LeGland and N. Oudjane Remark 5.5. If in addition φ, Λ and µ are F measurable r.v. s, and if conditionally w.r.t. F the r.v. s (ξ 1,, ξ N, ) are i.i.d. with (conditional) probability distribution µ, then the same estimate holds for conditional expectation w.r.t. F, i.e. (26) [ Λ S T (µ) Λ µ, φ F] 2 δ 1 + δ 2 φ, and ρ δ 2 [T F] ρ δ 2 (1 + δ2 ). 5.2. Interacting particle filter Let µ N n denote the interacting particle filter (IPF) approximation of µ n. Initially µ N 0 = µ 0, and the transition from µ N n 1 to µ N n is described by the following diagram µ N n 1 sampled prediction In practice, the particle approximation µ N n n 1 = SN (Q n µ N n 1) correction µ N n n 1 = 1 N δ ξ i, n n 1 µ N n = Ψ n µ N n n 1. is completely characterized by the particle system (ξn n 1 1,, ξn n n 1 ), and the transition from (ξn n 1 1,, ξn n n 1 ) to (ξ1 n+1 n,, ξn n+1 n ) consists of the following steps. (i) Correction : if the normalization constant c n = Ψ n (ξn n 1 i ), is positive, then for all i = 1,, N, compute the weight and set otherwise set µ N n = ν. ω i n = 1 c n Ψ n (ξ i n n 1 ), µ N n = ωn i δ ξ i, n n 1 (ii) Sampled prediction : independently for all i = 1,, N, generate a r.v. ξ i n+1 n Q n+1 µ N n, and set µ N n+1 n = SN (Q n+1 µ N n ) = 1 N δ ξ i. n+1 n The resampling step (ii) can be easily implemented : it requires to generate random variables either according to a weighted discrete probability distribution, or according to the arbitrary restarting probability distribution ν. Notice that the IPF satisfies (16). Remark 5.6. Without the reinitialization procedure, proposed initially by Del Moral, Jacod and Protter [10], the normalization constant c n could take the zero value, since the likelihood function Ψ n is not necessarily positive, and µ N n = Ψ n µ N n n 1 would not be a well defined probability distribution. By construction, the sequential particle filter defined at the end of this section does not run into this problem.

Stability and Uniform Approximation of Nonlinear Filters 17 Remark 5.7. If the nonnegative kernel R n is mixing, then inf Q n µ, Ψ n = inf (R n µ)() ε 2 n (R n µ n 1 )() = ε 2 n µ n n 1, Ψ n, µ P() µ P() hence a.s. inf Q n µ, Ψ n > 0, µ P() in view of Remark 2.1. Without loss of generality, it is assumed that the likelihood function is bounded. Assumption L : sup Ψ k (x) <. x If Assumption L holds, and if for any k 1, the nonnegative kernel R k is mixing, then the following notation is introduced ρ k := and in view of Remark 5.7, ρ k is a.s. finite. sup Ψ k (x) x inf Q k µ, Ψ k, µ P() Theorem 5.8. If for any k 1, Assumption L holds, and the nonnegative kernel R k is mixing, then the IPF estimator satisfies where for any k 1 sup [ µ n µ N n, φ Y 1:n ] δn W + 2 δw n 1 ε 2 + 4 n log 3 δ W k 1 N 2 ρ k. n 2 τ n:k+3 k=1 δ W k ε 2 k+2 ε2 k+1, The convergence result stated in Theorem 5.8 would still hold with a time dependent number of particles. Remark 5.9. If the transition kernel Q n+1 is dominated, i.e. Q n+1 (x, ) is absolutely continuous w.r.t. λ n+1 M + (), with density q n+1 (x, ) bounded by c n+1, for any x, then convergence in the weak sense of the particle filter can be used to prove convergence in total variation of the particle predictor. Indeed, using Lemma 3.5 yields [ µ n+1 n Q n+1 µ N n Y 1:n ] c n+1 λ n+1 () sup [ µ n µ N n, φ Y 1:n ], where both µ n+1 n and Q n+1 µ N n are absolutely continuous w.r.t. λ n+1, and d(q n+1 µ N n ) dλ n+1 (x ) = for any x, which can be easily computed. ωn i q n+1 (ξn n 1 i, x ),

18 F. LeGland and N. Oudjane Remark 5.10. In general, it is not realistic to assume that the r.v. s ρ k are uniformly bounded, hence it seems difficult to guarantee that convergence holds uniformly in time, for a given observation sequence. On the other hand, averaging over observation sequences makes it possible to obtain convergence uniformly in time, under more realistic assumptions. Indeed, if for any k 1, the nonnegative kernel R k is mixing with nonrandom ε k, and [ρ k ] is finite, then where for any k 1 Remark 5.11. sup µ n µ N n, φ δ n + 2 δ n 1 ε 2 + 4 n log 3 δ k = [δ W k ] 1 N 2 [ρ k ]. n 2 δ k τ n:k+3 ε 2 k=1 k+2 ε2 k+1 Notice that, if the nonnegative kernel R k is mixing, then sup Ψ k (x) sup Ψ k (x) x µ k, Ψ k ρ x k ε 2 k µ k, Ψ k, and it follows from Remark 2.2 that sup Ψ k (x) x [ µ k, Ψ k Y 1:] = [ sup g k (x, y) ] λ F k (dy), F x hence a necessary and sufficient condition for [ρ k ] to be finite, is [ sup g k (x, y) ] λ F k (dy) <. x F Corollary 5.12. If for any k 1, the nonnegative kernel R k is mixing with ε k ε > 0 and nonrandom ε, and [ρ k ] ρ, then convergence, averaged over observation sequences, holds uniformly in time, i.e. sup µ n µ N n, φ ( 1 + 2 ε 2 + 4 ε 6 log 3 ) δ with δ 1 2 ρ. N, Proof of Theorem 5.8. It is sufficient to bound the local error δk W in the weak sense, and to apply Theorem 4.8. Since R k is mixing, Q k µ N, Ψ k > 0 in view of Remark 5.7. Using estimate (25) with Λ = Ψ k, µ = Q k µ N and F = σ(y 1:k, µ N ), yields [ µ N k R k (µ N ), φ Y 1:k, µ N ] (27) = [ Ψ k (S N (Q k µ N )) Ψ k (Q k µ N ), φ Y 1:k, µ N ] 2 N sup Ψ k (x) x Q k µ N, Ψ k φ 1 2 ρ k φ. N Remark 5.13. Let ηn N denote the interacting particle system (IPS) approximation of η n considered in [11] and in other works by the same authors. Initially η0 N = S N (η 0 ), and the transition from ηn 1 N to is described by the following diagram η N n η N n 1 correction η N n 1 = Ψ n 1 η N n 1 sampled prediction η N n = S N (Q n η N n 1).

Stability and Uniform Approximation of Nonlinear Filters 19 Clearly η N 0 = Ψ 0 η N 0 = Ψ 0 S N (η 0 ), and the transition from η N n 1 to η N n is described by the following diagram η N n 1 sampled prediction η N n = S N (Q n η N n 1) correction η N n = Ψ n η N n, which involves exactly the same steps as the transition from µ N n 1 to µ N n described above : only the initial conditions η 0 N = Ψ 0 S N (η 0 ) and µ N 0 = µ 0 are different. Using the following decomposition of the global error into an initial error and a sum of local errors transported by a sequence of normalized evolution operators n η n N η n = [ R n:k+1 ( η k N ) R n:k ( η ) N ] + [ R n:1 ( η 0 N ) R n:1 ( η 0 ) ], k=1 and proceeding as in the proof of Theorem 4.8, yields under the assumptions of Theorem 5.8 sup [ η n η n N, φ Y 0:n ] δn W + 2 δw n 1 ε 2 + 4 n log 3 where for any k 0 Finally, notice that hence n 2 δk W τ n:k+3 ε 2 k=1 k+2 ε2 k+1 δk W 1 sup Ψ 0 (x) x 2 ρ k and ρ 0 :=. N µ 0, Ψ 0 η n+1 η N n+1 = Q n+1 η N n S N (Q n+1 η N n ) + Q n+1 ( η n η N n ), sup [ η n+1 ηn+1, N φ Y 0:n ] 1 + N sup + 4 log 3 τ n:3 [ η n η N n, φ Y 0:n ]. δ0 W ε 2 2 ε2 1 This proves the uniform convergence of the IPS approximation to the optimal predictor, with rate 1/ N, for exactly the model considered in [9, 11], under our weaker mixing assumption., In the proof of Theorem 5.8, if we use S N (Q k µ N ), Ψ k instead of Q k µ N, Ψ k as the denominator in equation (27), we see that, for the local error to be small, the empirical mean of the likelihood function over the predicted particle system should be large enough. This theoretical argument is also supported by numerical evidence, in cases where the likelihood function is localized in a small region of the state space (which typically arises when measurements are accurate). Indeed, such a region can be so small that it does not contain enough points of the predicted particle system, which automatically results in a small value of the predicted empirical mean of the likelihood function. This phenomenon is called degeneracy of particle weights and is a known cause of divergence of particle filters. To solve this degeneracy problem, one idea is to add a regularization step to the algorithm : the resulting filters, called regularized particle filters (RPF) are studied in the next section. Another idea is to control the predicted empirical mean S N (Q k µ N ), Ψ k = 1 Ψ k (ξk i N ), by using an adaptive number of particles. To guarantee a local error of order δ k, independently of any lower bound assumption on the likelihood function, we choose a random number of particles (28) N k := inf{n : δk 2 Ψ k (ξk i ) sup Ψ k (x)}, x that will automatically fit the difficult case of localized likelihood functions : the resulting filter, called sequential particle filter (SPF) is studied below.

20 F. LeGland and N. Oudjane 5.3. Sequential particle filter Let µ Nn n µ n. Initially µ N0 0 = µ 0, and the transition from µ Nn 1 n 1 µ Nn 1 n 1 sequential sampled prediction µ Nn In practice, the particle approximation denote the sequential particle filter (SPF) approximation of to µ Nn n is described by the following diagram n n 1 = (Q SNn n µ Nn 1 n 1 ) correction µ Nn n n 1 = 1 N n N n δ ξ i, n n 1 µ Nn n = Ψ n µ Nn n n 1. is completely characterized by the particle system (ξn n 1 1,, ξnn n n 1 ), and the transition from (ξn n 1 1,, ξnn n n 1 ) to (ξ1 n+1 n,, ξnn+1 n+1 n ) consists of the following steps. (i) Correction : for all i = 1,, N n, compute the weight ωn i = 1 Ψ n (ξn n 1 i c ), n with the normalization constant N n c n = Ψ n (ξn n 1 i ), and set µ Nn n Nn = Ψ n µ Nn n n 1 = ω i n δ ξ i n n 1. (ii) Sequential sampled prediction : independently for all i = 1,, N n+1, generate a r.v. ξn+1 n i Q n+1 µ Nn n, where the random number N n+1 of particles is defined by the stopping time N n+1 = inf{n : δn+1 2 Ψ n+1 (ξn+1 n i ) sup Ψ n+1 (x)}, x and set µ Nn+1 n+1 n = SNn+1 (Q n+1 µ Nn n ) = 1 N n+1 n+1 δ ξ i n+1 n. xactly as for the IPF, the resampling step (ii) can be easily implemented : it only requires to generate random variables according to a weighted discrete probability distribution. Notice that a.s. 1 Ψ n (ξn n 1 i N ) Q n µ Nn 1 n 1, Ψ n as N, and if the nonnegative kernel R n is mixing, then Q n µ Nn 1 n 1, Ψ n > 0 in view of Remark 5.7, hence the stopping time N n is a.s. finite. Moreover, the normalization constant c n is positive, since N n c n = Ψ n (ξ i n n 1 ) 1 δ 2 n sup Ψ n (x) > 0, x hence Ψ n µ Nn n n 1 is a well defined probability distribution. Notice that the SPF satisfies (16). The following theorem shows that using a random number of particles allows to control the local error independently of any lower bound assumption on the likelihood functions. The counterpart is that the computational time of the resulting algorithm is random, and that the expected number of particles does depend on the integrated lower bounds of the likelihood functions.

Stability and Uniform Approximation of Nonlinear Filters 21 Theorem 5.14. If for any k 1, the nonnegative kernel R k is mixing, and the random number N k of particles is defined as in (28), then the following inequality holds where for any k 1 and sup [ µ n µ Nn n, φ Y 1:n ] δn W + 2 δw n 1 ε 2 + 4 n log 3 ρ k δ 2 k δ W k 2 δ k 1 + δ 2 k, [N k Y 1:k ] ρ k δk 2 (1 + δk) 2. n 2 δk W τ n:k+3 ε 2 k=1 k+2 ε2 k+1 Corollary 5.15. If for any k 1, the nonnegative kernel R k is mixing with ε k ε > 0, and the random number N k of particles is defined as in (28) with δ k δ, then convergence holds uniformly in time, i.e. sup [ µ n µ Nn n, φ Y 1:n ] ( 1 + 2 ε 2 + 4 ε 6 log 3 ) 2 δ 1 + δ 2. Proof of Theorem 5.14. It is sufficient to bound the local error δk W in the weak sense, and to apply Theorem 4.8. Since R k is mixing, Q k µ N, Ψ k > 0 in view of Remark 5.7. Using estimate (26) with Λ = Ψ k, µ = Q k µ N and F = σ(y 1:k, µ N ) yields [ µ N k k R k (µ N ), φ Y 1:k, µ N ], = [ Ψ k S N k (Q k µ N ) Ψ k (Q k µ N ), φ Y 1:k, µ N ] 2 δ k 1 + δ 2 k φ. In this section, we have proved that the IPF and its sequential variant converge uniformly in time under the mixing assumption. This theoretical argument is also supported by numerical evidence, e.g. in extreme cases where the hidden state sequence satisfies a noise free state equation. Indeed, because multiple copies are produced after each resampling step, the diversity of the particle system can only decrease along the time in such cases, and the particle system ultimately concentrates on a few points, if not a single point, of the state space. This phenomenon is called degeneracy of particle locations and is another known cause of divergence of particle filters. To solve this degeneracy problem, and also the problem of degeneracy of particle weights already mentionned, we have proposed in Musso and Oudjane [23] to add a regularization step in the algorithm, so as to guarantee the diversity of the particle system along the time : the resulting filters, called regularized particle filters (RPF), are studied in the next section under the same mixing assumption. 6. Uniform convergence of regularized particle filters The main idea consists in changing the discrete approximation µ N n for an absolutely continuous approximation, with the effect that in the resampling step N random variables are generated according to an absolutely continuous distribution, hence producing a new particle system with N different particle locations. In doing this, we implicitly assume that the hidden state sequence takes values in a uclidean space = R m, and that the optimal filter µ n has a smooth density w.r.t. the Lebesgue measure, which is the case in most applications. From the theoretical point of view, this additional assumption allows to obtain strong approximations of the optimal filter, in total variation or in the L p sense for any p 1. In practice, this provides approximate filters which are much more stable along the time than the IPF.

22 F. LeGland and N. Oudjane To obtain an absolutely continuous approximation is achieved by adding a regularization step in the algorithm, using a kernel method, classical in density estimation. If the regularization occurs before the correction by the likelihood function, we obtain the pre regularized particle filter, the numerical analysis of which has been done in Le Gland, Musso and Oudjane [21], in the general case without the mixing assumption. An improved version of the pre regularized particle filter, called the kernel filter (KF), is proposed in Hürzeler and Künsch [18]. If the regularization occurs after the correction by the likelihood function, we obtain the post regularized particle filter, which has been proposed in Musso and Oudjane [23] and in Oudjane and Musso [27] and compared with the IPF in some classical tracking problems, such as bearing only tracking, or range and bearing tracking with multiple dynamical model. The local rejection regularized particle filter (L2RPF), which generalizes both the KF and the post RPF, is introduced in Musso, Oudjane and Le Gland [24], where further implementation details and applications to tracking problems can be found. 6.1. Notations and preliminary results The following notations and definitions will be used below. Throughout the end of this paper, = R m. For any µ P(), define I(µ) := [ x m+1 µ(dx) ] 1/m+1, and if µ is absolutely continuous w.r.t. the Lebesgue measure on, with density f = dµ dx, define I(f) = I(µ) = [ x m+1 f(x) dx ] 1/m+1 and J(f) = J( dµ dx ) := f(x) dx. From the multidimensional Carlson inequality, see Holmström and Klemelä [16, Lemma 7], there exists a universal constant A m such that, for any absolutely continuous µ P() (29) J( dµ dx ) A m (I(µ)) m/2, hence J( dµ dx ) is finite if I(µ) is finite. Let W 2,1 denote the Sobolev space of functions defined on, which together with their derivatives up to order two, are integrable w.r.t. the Lebesgue measure on. Let 2,1 and 2,1 denote the corresponding norm and semi norm, i.e. u 2,1 := D i u(x) dx and u 2,1 := D i u(x) dx, 0 i 2 respectively, where for any multiindex i = (i 1,, i m ) of order i = i 1 + + i m i =2 D i = i1+ +im x i1 1 xim m. Let the regularization kernel K be a symmetric probability density on, such that K(x) dx = 1, x K(x) dx = 0 and α := 1 2 x 2 K(x) dx <. Assume also that the regularization kernel K is square integrable, i.e. β := [ K 2 (x) dx ] 1/2 < and that the symmetric probability density L = K2 β 2 satisfies γ := I(L) = [ x m+1 L(x) dx ] 1/m+1 <.