Robust Non-linear Smoother for State-space Models

Size: px

Start display at page:

Download "Robust Non-linear Smoother for State-space Models"

Gary Anderson
5 years ago
Views:

1 Robust Non-linear Smoother for State-space Models Gabriel Agamennoni and Eduardo M. Nebot Abstract This paper presents a robust, non-linear smoothing algorithm state-space models driven by noise and external inputs. This algorithm is extremely robust to outliers and handles missing data and state-dependent noise. Its implementation is straightforward as it consists of two main components: (a) the Rauch- Tung-Striebel recursions (a..a. the Kalman smoother); and (b) a bac-tracing line search strategy. Since the algorithm preserves the underlying structure of the problem, its computational load is linear in the number of data. Global convergence to a local optimum is guaranteed under mild assumptions. I. INTRODUCTION A. State-space models and optimal estimation State Space Models (SSMs) are ubiquitous in many fields of engineering and applied sciences. Their popularity stems from their flexibility and expressive power. Formally, an SSM is a mathematical model of a dynamic system with inputs and outputs. It represents the system s dynamics and its inputoutput characteristics in terms of latent states. Unlie inputs and outputs, the states are hidden, i.e. they cannot be measured directly (e.g. with a sensor). Optimal estimation is the problem of determining the states given pairs of input-output data. Due to random fluctuations present in the input-output and state processes, this is not possible in general. The state estimates are inevitably afflicted with uncertainty. The random processes are often specified by conditional probability distributions. The most common distribution is the Gaussian, justified by the central limit theorem. It is favored for its convenient analytical properties, although it is seldom motivated by the nature of the actual data. Because it appears relatively frequently, there is an unfortunate tendency to invoe the Gaussian in situations where it is not applicable. In these cases there is a significant ris of drawing incorrect conclusions about the system. The Kalman Filter (KF) [1] is the precursor of many modern-day estimators []. It is optimal (in the least squares sense) for SSMs with linear dynamics and input-output relationships and Gaussian noise [3]. Unfortunately, the KF breas down in the presence of non- Gaussian noise since the sum-of-squares criterion is extremely sensitive to spurious observations [4]. 1 In addition, the KF does not apply to non-linear systems, which are the most common and most relevant in practical applications. The authors are with the Australian Centre for Field Robotics at the University of Sydney, New South Wales, Australia. 1 Since the conditional mean is an unbounded function of the residual, when a large discrepancy arises between the prior and the observation, the posterior distribution becomes an unrealistic compromise between the two. B. Robustness to outliers Outliers are common non-gaussian phenomena [], [6]. Intuitively, they are observations that do not agree with the rest of the data. Even though they may occur by chance in most distributions, outliers often stem from processes that are either unnown or are deliberately left out of the model, e.g. environmental disturbances, sensor failures or factors that are tedious or impractical to model. Systems that rely on high-quality sensor data tracing and control systems, autonomous vehicles, etc. may be sensitive to outliers. In some cases, they may fail catastrophically [7] [9] to the point that a full recovery is impossible. Hence the importance of estimators robust to outliers. C. Related wor In this paper we address a general class of robust estimation problems. Namely, we are concerned with non-linear systems with heavy-tailed and potentially heteroscedastic (i.e. statedependent) noise processes. To the best of the authors nowledge, no one has addressed all three aspects simultaneously. 3 The following wor is related to ours: a. Aravin et al. [11] formulated a robust smoother for nonlinear systems as an optimization problem, although the system has homoscedastic noise; b. Piché et al. [1] presented an outlier-robust filter/smoother for non-linear systems in the context of Assumed Density Filtering (ADF), again with state-independent noise; c. Särä and Nummenmaa [13] introduced a filter that tracs states and noise, albeit for linear systems with timevarying rather than state-dependent noise; d. Agamennoni et al. [14] developed robust filters and smoothers for simultaneously estimating states and noise, though again for linear, time-varying systems. D. Major contributions The major contributions in this paper are: a. A non-trivial generalization of the model introduced by Aravin et al. [11] to the heteroscedastic case; b. A computationally efficient and provably convergent algorithm for approximately solving the smoothing problem unlie [11], our approach does not require approximating the Hessian matrix; and c. A parametrization of the problem carefully designed to mae equations readily interpretable and assert the strong For instance, the Gaussian places over 99% of its probability mass within the interval μ ± 3σ. An outlier σ away from μ has less than one in a million chances of occurring. Although the possibility exists, it is unrealistically small. 3 The Extended Kalman Filter (EKF) [1] could be made robust with linear techniques. However, because it is relies on a first-order Taylor expansion, it would only suitable for mildly non-linear systems.

2 connection to the well-nown Rauch-Tung-Striebel (RTS) recursions; The potential of our approach is demonstrated via experiments on synthetic data from a highly non-linear system. All of the code used to generate the results in this paper is available from the authors web page. E. Outline of this paper Section II defines the smoothing problem and its robust version. Section III proposes and develops an approximation that renders the robust non-linear smoothing problem tractable. In section IV this approximation taes the form of an iterative optimization algorithm. Experiments on synthetic data validate our approach in section V. Finally, section VI concludes and outlines directions for future research. II. A. Definitions THE SMOOTHING PROBLEM Let X and Z be, in that order, the state and measurement sequences. Namely, X =(x 1,...,x n ) R d... R d Z =(z 1,...,z n ) R d1... R dn are real, finite sequences with n terms each. (1a) (1b) Let g and h be, respectively, the one-step prediction function and the observation function for the th term and let Q and R be their corresponding variance-covariance matrices. Specifically, g : R d R d h : R d R d Q : R d P d d R : R d P d d (a) (b) (c) (d) for = 1,...,n, with P d d denoting the cone of d d symmetric positive-definite matrices. Last of all, let {u } and {v } be white Gaussian noise processes that represent the random fluctuations driving the prediction and observation models. That is, u N(, I) v N(, I) (3a) (3b) for all, where N (μ, Σ) denotes a multi-variate Gaussian distribution with mean vector μ and variance-covariance matrix Σ. 4 4 Throughout this manuscript we will use the same symbols (e.g. u and v ) to denote both the random variable and its outcome, the random variate. Although this is a slight abuse of notation, it is for the sae of clarity and should cause no confusion. B. Non-linear smoothing Assume that the state and observation sequences are generated via the following processes: x = g (x 1 )+Q 1 (x 1 ) u (4a) z = h (x )+R 1 (x ) v (4b) for =1,...,n and x N(μ, Σ). Or equivalently, x x 1 N(g (x 1 ), Q (x 1 )) (a) z x N(h (x ), R (x )) (b) Then, the probability density function of the joint distribution over state and observation sequences is p (X, Z) =p (x ) p (x x 1 ) p (z x ) =1 Notice that, although states and observations are conditionally Gaussian, they are not jointly Gaussian in the general case. 6 Given Z, the non-linear smoothing problem consists of finding p (X Z). These distributions are non-gaussian and generally intractable, maing the problem an extremely challenging one with no closed-form solution. There is no choice but to see an approximation. C. Robust non-linear smoothing We define the robust non-linear smoothing problem as the one obtained by replacing (b) with z x T (h (x ), R (x ),s ) (6) where s > is nown and T (μ, Σ,ν) stands for a multivariate t distribution [1] with location vector μ, scale matrix Σ and ν degrees of freedom. D. The t distribution The t is a sub-exponential distribution, meaning that its tails fall off to zero at a less-than-exponential rate. Compared to the Gaussian, which is super-exponential, the t has much heavier tails. Their rate of decay is determined by the number of degrees of freedom: in the limit of ν the tails flatten and the t reduces to a Gaussian. For smaller and smaller ν the probability mass spreads more and more evenly across observation space and further away from the mode, assigning outliers a non-negligible probability. Placing a non-negligible probability on outliers is by no means a drawbac; it simply reflects reality. The Gaussian concentrates most of its probability mass within a small region around the mode, essentially ruling out the possibility that any observation is ever wrong. The t maes no such mistae. In (6) we acnowledge the fact that, occasionally, observations may be off. By imparting this information directly into our model we enable it to deal with outliers natively within the filtering/smoothing framewor. Consequently, there is no need Symbol A 1 denotes the lower-triangular Cholesy factor of the symmetric, positive-definite matrix A. 6 The only case where (X, Z) are jointly Gaussian is if g and h are affine and Q and R are constant for all =1,...,n. In any other case (X, Z) are non-gaussian.

3 for us to explicitly pre-process outliers (e.g. with a rejection threshold) or treat them separately because our model is now capable of explaining them. III. THE APPROXIMATE SMOOTHING PROBLEM A. The Gaussian-Gamma decomposition The t may be regarded a weighted sum of infinitely many Gaussian-distributed random variables with identical mean and proportional variance-covariance parameters [16]. Namely, (6) is equivalent to z x,w N(h (x ), R (x )/w ) (7a) w G(s /,s /) (7b) where G (α, β) denotes a Gamma distribution with shape and rate parameters α and β. Variable w is an ancillary variable that renders the observation z conditionally Gaussian. When marginalized out it yields the t distribution in (6). We will call w the weight of the th observation. Applying the Gaussian-Gamma decomposition in (7) leads to a joint probability density function of the form p (X, Z, W) =p (x ) p (x x 1 ) p (z x,w ) p (w ) =1 and to the following conditional probability density function: p (X Z) = p (X, W Z) dμ (W ) where W =(w 1,...,w n ) is the sequence of weights and μ is the product measure. The weight sequence that originates from this decomposition simplifies the smoothing problem considerably. Rather than approximating p (X Z) directly, we first find an approximation to p (X, W Z) and then marginalize out W to obtain our final result. B. The Kullbac-Leibler divergence Let q be an approximation to p (X, W Z). The Kullbac- Leibler (KL) divergence [17] from q to p is defined as q (X, W) KL [q p] = q (X, W)ln dμ (X) dμ (W ) p (X, W Z) The KL quantifies the distance between q and p, i.e. the error in the approximation of the true posterior. It is non-negative for all q and vanishes if and only if q = p. Our goal now is to find a mathematical form for the approximating distribution such that minimizing the KL divergence is analytically tractable. This involves a trade-off. On one hand, q should be flexible enough so that KL [q p] may be brought close to zero. On the other hand, q should be simple enough so that we can do this tractably. Resolving the inherent tension between flexibility and tractability is the ey to finding a good approximation. C. An approximate tractable posterior For reasons that will become clear shortly, we select an approximation q with the following mathematical form: q (X, W) = δ (x, ˆx ) γ (w ) (8) =1 =1 where δ is Kronecer s delta function and γ is is defined as γ (w )= βα Γ(α ) wα 1 exp ( β w ) (9) for w > and γ (w )=if w. 7 To obtain the best approximation in the KL sense amongst those of the form in (8), we solve min KL [q p] (1) ˆX,θ where ˆX =(ˆx 1,...,ˆx n ) and θ =(α 1,β 1,...,α n,β n ). The solution to this problem is a sequence ˆX of states which, owing to the structure we chose for q in (8), is the approximation to p (X Z) that we see. D. An approximate estimation algorithm In order to solve (1) we apply a coordinate-wise descent algorithm. Starting from an initial guess of ˆX and θ, we cycle between the following steps: a. Minimize KL [q p] w.r.t. ˆX eeping θ fixed; and b. Minimize KL [q p] w.r.t. θ eeping ˆX fixed; until they converge to a local minimum. Convergence is guaranteed because the divergence is non-negative and decreases or remains constant after each cycle. IV. AN APPROXIMATE SMOOTHING ALGORITHM A. The approximate posterior over weights Let us begin with step b. of the algorithm (we will deal with a. soon). Minimizing KL [q p] with respect to θ while eeping ˆX fixed yields 8 α = s + d (11a,b) β = s +(z h (x )) R (x ) 1 (z h (x )) for =1,...,n. Hence the mean of the approximate posterior distribution over w is ω = w γ (w ) dw = α /β (1) Note that the weight decreases as the normalized error increases, i.e. observations that lie far away from their forecast are down-weighted. 7 This is identical to the probability density function of a Gamma-distributed variable with shape and rate parameters α and β, respectively. 8 Although straightforward, the derivation of this result is fairly lengthy. The reader is encouraged to verify (11a,b) by replacing (8) in the definition of the KL divergence, evaluating the expectations with respect to X and dissecting out the term involving w, comparing it with (9) and matching terms.

4 B. The approximate posterior over states Let us now move on to step a. of the algorithm. Minimizing KL [q p] with respect to ˆX while holding θ constant equates to solving the following minimization problem: min b (X) (13) X The objective b is a quadratic-composite function 9 b (X) = t (x 1, x ) =1 u (x 1, x ) u (x 1, x ) =1 ω v (x ) v (x )+... =1 (14) where... denotes additive constants (i.e. terms independent of X) and the terms in the summations are given by t (x 1, x )=lndetq (x 1 )+lndetr (x ) u (x 1, x )=Q 1 (x 1 ) 1 (x g (x 1 )) v (x )=R 1 (x ) 1 (z h (x )) for =1,...,n. C. The sequential quadratic program Due to the quadratic-composite structure of (14), our minimization problem (13) lends itself to a special formulation nown as Sequential Quadratic Program (SQP) [18]. An SQP is an iterative method for breaing down the full problem and solving it as a sequence of sub-problems. Each SQP iteration computes a sequence Y =(y 1,...,y n ) of search directions together with a step size h and updates the state sequence X to X + hy, i.e. it adds an increment of hy. The sequence Y of search directions results from linearizing the objective function b around the terms t, u and v for = 1,...,n and evaluating them at the current state sequence X. Differentiation plus a bit of algebra reveals that Y is the solution to min f (y 1, y ) (1) y 1,...,y n =1 where f is defined in (6) and (7a) to (7d). 1 If we tae a close loo at (6) we realize that the mathematical form of (1) is almost the same as that of a linear, timevarying Kalman smoother. Thus we can easily solve (1) by running a slightly modified form of the Rauch-Tung-Striebel (RTS) recursions [1]. The additional terms q and r, which are caused by state-dependent noise, are readily accounted for by performing an extra correction step with fictitious zerovalued observations. 9 In other words, b, as a function of t, u and v, is quadratic even though it is not necessarily quadratic in x. 1 Note that (7a) to (7d) require evaluating derivatives of the Cholesy factor. The factor and its derivatives can be evaluated simultaneously [19] with the same order of complexity as the original Cholesy algorithm []. The step size h is chosen by performing a line search along the direction Y. In order to guarantee global convergence, 11 we only consider values of h that satisfy Armijo s rule [18]. Given a constant rejection threshold τ (, 1), we search for the largest h such that b (X + hy ) b (X)+τh b (X),Y (16) where is the gradient operator and, is the inner product. (The partial derivatives of b necessary to form the gradient may be found in appendix A.) Let H be the set of all h, h>, that satisfy (16) for a given Y. To find a suitable step size within this set, we apply a bac-tracing line search strategy, which is fast yet effective. Specifically, we compute h as h =max { λ j 1 H,j=1,,... } (17) where λ (, 1) is a constant step reduction factor. Starting from h =1, we chec to see whether h H. If so, we accept h as our step size; otherwise, we reduce h by a factor λ. 1 The SQP terminates upon convergence. At each iteration, we chec the condition b (X),Y nɛ where ɛ> is a constant termination tolerance. If this is true, we have arrived at a local optimum. D. Implementation and pseudo-code Algorithm 1 provides an implementation in pseudo-code of our robust non-linear smoother. Note that the weights (line a.) are re-evaluated once per iteration, i.e. we are performing only one SQP iteration per cycle of the coordinate-wise descent algorithm (sub-section III-D). This means we are not exactly minimizing KL [q p] with respect to ˆX but reducing it. Still, since Y is a descent direction, (16) implies that b (X + hy ) is strictly smaller than b (X) and consequently the divergence decreases after each cycle. We have implemented algorithm 1 in the MatLab language. The source code is available on-line and includes a test script. Interested readers may download the files from [] into their woring directories, compile the.c source code into.mex binaries with MatLab s built-in compiler and type TestRNLS() in the command prompt. Documentation and details of out implementation may be found in the help messages, by typing help RNLS and help RNSL.Estimate, as well as in the comments provided. V. EXPERIMENTAL VALIDATION A. The LX systems To test our algorithm we generated synthetic data from a high-dimensional and strongly non-linear system. The model introduced by Lorenz and Emanuel [3] simulates atmospheric phenomena at equally-spaced sites along a circle of latitude. 11 In the context of non-linear optimization, global convergence does not mean the method converges to the global optimum but that convergence to a local optimum is guaranteed no matter the starting point. 1 Provided that b is a smooth function, the set H is non-empty for all h>. Hence bac-tracing line search is always well defined and terminates finitely many iterations.

5 f (y 1, y )= 1 (y G (x 1 ) y 1 g (x 1 )+x ) Q (x 1 ) 1 (y G (x 1 ) y 1 g (x 1 )+x ) +q (x 1 ) y 1 + ω (z H (x ) y h (x )) R (x ) 1 (z H (x ) y h (x )) + r (x ) y +... [ ] (6) G (x 1, x )= i g (x 1 )+ i Q 1 (x 1 ) u (x 1, x ) (7a) [ ] H (x )= i h (x )+ i R 1 (x ) v (x ) (7b) [ ( ) ] q (x 1 )= tr Q 1 (x 1 ) 1 i Q 1 (x 1 ) (7c) [ ( ) ] r (x )= tr R 1 (x ) 1 i R 1 (x ) (7d) Fig. 1. The th term of the objective function of the linearized sub-problem (1). In (7a) and (7b) only the ith columns of matrices G and H are shown; in (7c) and (7d) only the ith elements of vectors q and r. Operator i computes the partial derivatives with respect to the ith state and tr is the trace operator. In: Observation sequence Z =(z 1,...,z n ), line search rejection threshold τ, step size reduction factor λ and termination tolerance ɛ. Out: State sequence X =(x 1,...,x n ). a. Initialize X. repeat b. Update the sequence (ω 1,...,ω n ) of weights according to (1) and (11). c. Compute the sequence Y =(y 1,...,y n ) of search directions by solving (1), where f is defined in (6) and (7), via modified RTS recursions. d. Keeping the weights fixed, find a step size h that satisfies (16) by way of (17). e. Update X as X + hy. until convergence Algorithm 1: The robust non-linear smoother. It comprises a system of differential equations with quadratic, linear and constant terms representing advection, dissipation and external forces. It has been studied extensively in the context of data assimilation [4], []. The model itself is parameterized by size. A size-d system has a total of d states which obey ẋ i =(x i+1 mod d x i modd ) x i 1 modd x i + u (18) for i =1,...,d, where mod is the modulus operator and u is the external driving force. B. Prediction and observation models Letẋ (t) =f (x (t)) denote the system of differential equations defined above (18). The one-step prediction function g is defined via the following Euler approximation: g (x 1 )=x 1 +Δtf (x (t)) (19a) where Δt > is the sampling period. The prediction uncertainty Q is defined as Q (x 1 )=Δt δ I (19b) with δ>. We assume that all of the states of the systems are directly observable, and hence h (x )=x (a) where H is a matrix that selects the non-missing observations from x and concatenates them into a vector. The observation uncertainty R is diagonal and is defined as ([ ( ) ]) R (x )=diag r (b) x (i) for all =1,...,n, where x (i) is the ith element of the th state and r is the following mapping: r : x ρ x + ɛ where ρ> is a gain parameter and ɛ> is a regularization constant. The observation uncertainty function in (b) represents a constant relative noise model. In other words, the absolute noise increases with the magnitude of the signal. C. Synthetic data Each set of data comprises a pair (X, Z) of sequences. The sequence X of states is generated by simulating the model according to (a), with g and Q given by (19a) and (19b), respectively, for n 1 steps. The first initial is sampled close to the attractor. 13 The sequence Z of observations is generated one observation at a time, sampling from (7b) and (7a) with h and R given by (a) and (b). Table I summarizes the values of the model and sampling parameters we used to generate the data for our experiments. A typical pair of state and observation sequences may be seen in figure. D. Results The metrics we selected for evaluating performance are the Root Mean Squared (RMS), Maximum (Max), Mean Absolute 13 To do so we first simulate the model for a given burn-in time.

6 TABLE I. Continuous-time model parameters Discrete-time model parameters Sampling parameters SUMMARY OF MODEL AND SAMPLING PARAMETERS Name Symbol Value Number of states d Driving force u Sampling period Δt 1/ Predictive uncertainty δ 3 Observation noise gain ρ 1/ Regularization constant ɛ 1 1 Number of data n zt xt RNLS xt t Fig. 3. The same sequences of states and observations as those in figure superimposed on the state estimates returned by the RNLS (abobe) and the sampler (below). The estimates are drawn in gray. zt t Fig.. Typical sequences of states (above) and observations (below) in the synthetic data. In the upper panel the states are plotted as a solid blac line. In the lower panel observations are depicted as blac dots. (MA) and Maximum Absolute (AMax) errors, given by RMS = 1 x ˆx n =1 Max = max x ˆx =1,...,n MA = 1 x ˆx n =1 AMax = max x ˆx =1,...,n respectively, where ˆx is the estimate of the th state in X, denotes the Euclidean and the Manhattan norm. We generated sets of data. For each set, we first ran a windowed median filter with a window size of. seconds to obtain an initial guess of the state sequence. Then we passed this initial guess to algorithm 1. Upon convergence, we too the estimates returned by the RNLS and fed them to a bloc component-wise Metropolis- Hastings () sampler [6]. The sampler was run for over 1 steps (including burn-in and pre-thinning), simulating a total of 1 samples from the posterior distribution over state trajectories. Figure 3 shows the same state and observation sequences as those in figure, plus the estimates returned by the RNLS and the sampler. For the RNLS, the estimates are depicted as 99% confidence intervals, which are derived by combining the mode (i.e. the sequence returned by the algorithm) with the variance-covariance parameters obtained during the RTS recursions. For the sampler, the estimates are simply the point clouds in the sample. Table II and figure 4 summarize the performance metrics we obtained from our experiment. The last column in the table shows statistics for the difference between the errors attained by the RNLS and the sampler. Note the scaling of the vertical axes in the figure Error TABLE II. SUMMARY OF PERFORMANCE METRICS RNLS - RNLS Mean Std. dev. Mean Std. dev. Mean Std. dev. RMS Max MA AMax Root mean squared Maximum Mean absolute Maximum absolute Fig. 4. Box plot of the performance metrics achieved by the RNLS and the sampler. Note the scaling of the vertical axes. E. Discussion Our RNLS is able to trac the states accurately despite the poor quality of the data. The sequence of observations in fig. bears little resemblance to the underlying state sequence. Regardless, fig. 3 shows that our algorithm successfully separates signal from noise. The confidence intervals shrin and widen according to the local noise level. The metrics for the RNLS are displayed alongside the sampler, not for comparison but as a baseline. What we want to show is that our algorithm performs almost on par with the best possible Bayesian estimator, or equivalently, that its excess ris is small. The advantage of the RNLS is its running time: with an average of 1.89 ±.793 seconds per sequence, it is almost 9 times faster than the sampler, which too an average of 17 ±.7 seconds.

7 To the best of the authors nowledge, no other algorithm in the literature simultaneously deals with non-linear systems, heavy-tailed noise and heteroscedasticity, within a fully deterministic framewor. VI. SUMMARY AND CONCLUSIONS The robust non-linear smoother has a great deal of potential for complex sequential estimation problems. It handles nonlinear dynamic and observation processes with state-dependent noise while remaining robust to to outliers and missing data. Thus far the authors are unaware of other algorithms with these capabilities. The core of the estimation algorithm is a coordinate-wise sequential quadratic program. Each quadratic sub-problem is solved in a numerically efficient and analytically interpretable way by a series of forward-bacward recursions analogous to the well-nown Raugh-Tung-Striebel smoother. The iterations are guaranteed to converge, provided the functions defining the model are smooth. At the moment we are looing into ways of extending our approach to allow for more general distributional assumptions. For instance, elliptical distributions [7] are attractive due to their generality and their compact parametrization 14. One could imagine a model parameterized by a pair of density generator functions that determine the rotational shape of the prediction and observation densities. In our experiments we ran a median filter in order to obtain an initial guess of the state estimates. There are many other possibilities. Studying the effects of different initialization schemes e.g. via an unscented KF [8], an extended KF [1] or its iterative variants [9] and assessing their relative merits would be an interesting direction to pursue in the future. APPENDIX A DERIVATIVES OF THE QUADRATIC-COMPOSITE FUNCTION The partial derivatives of b defined in (14) with respect to x evaluated at X are given by b (X) =Q 1 x (x 1) u (x 1, x ) G +1 (x, x +1 ) Q 1 +1 (x ) u +1 (x, x +1 ) +q +1 (x ) ω H (x ) R 1 (x ) v (x )+r (x ). where G, H, q and r are defined in (7a) to (7d). The gradient b (X) of b at the sequence X is the sequence whose th element is b/ x at X. REFERENCES [1] R. Kalman, A new approach to linear filtering and prediction theory, Transactions of the ASME Journal of Basic Engineering, Series D, vol. 8, pp. 3 4, 196. [] S. Roweis and Z. Ghahramani, A unifying review of linear-gaussian models, Neural Computation, vol. 11, no., pp. 3 34, [3] J. Morris, The Kalman filter: A robust estimator for some classes of linear quadratic problems, IEEE Transactions on Information Theory, vol., no., pp. 6 34, September In fact, the t distribution in (6) is elliptical with density generator g (x) = (1 + x/ν ) d for x. [4] P. Huber, Robust estimation of a location parameter, Annals of Mathematical Statistics, vol. 3, no. 1, pp , [] D. Moore and G. McCabe, Introduction to the Practice of Statistics. W.H. Freeman, [6] V. Barnett and T. Lewis, Outliers in Statistical Data. John Wiley & Sons, [7] J. Ting, A. D Souza, and S. Schaal, Automatic outlier detection: A Bayesian approach, in Proceedings of the IEEE International Conference on Robotics and Automation, 7. [8] T. Bailey and H. Durrant-Whyte, Simultaneous localization and mapping (SLAM): Part II, IEEE Robotics and Automation Magazine, vol. 13, no. 3, pp , September 6. [9] J. Loxam and T. Drummond, Student t mixture filter for robust, realtime visual tracing, in Proceedings of the 1th European Conference on Computer Vision: Part III, 8. [1] D. Simon, Optimal State Estimation. John Wiley & Sons, 6. [11] A. Aravin and G. Bure, J.V.and Pillonetto, Robust and trendfollowing student s t Kalman smoothers, optimization Online Preprint. [1] R. Piché, S. Särä, and J. Hartiainen, Recursive outlier-robust filtering and smoothing for non-linear systems using the multi-variate student-t distribution, in Proceedings of the IEEE International Conference on Machine Learning for Signal Processing, 1. [13] S. Sara and A. Nummenmaa, Recursive noise adaptive Kalman filtering by variational Bayesian approximations, IEEE Transactions on Automatic Control, vol. 4, no. 3, pp. 96 6, March 9. [14] G. Agamennoni, J. I. Nieto, and E. Nebot, Approximate inference in state-space models with heavy-tailed noise, IEEE Transactions on Signal Processing, vol. 6, no. 1, pp. 4 37, October 1. [1] B. Kibria and A. Joarder, A short review of the multivariate t- distribution, Journal of Statistical Research, vol. 4, no. 1, pp. 9 7, 6. [16] S. Kotz and S. Nadarajah, Multivariate t Distributions and their Applications. Cambridge University Press, 4. [17] S. Kullbac and R. Leibler, On information and sufficiency, Annals of Mathematical Statistics, vol., no. 1, pp , 191. [18] J. Nocedal and S. Wright, Numerical Optimization, P. Glynn and S. Robinson, Eds. Springer, [19] S. Smith, Differentiation of the cholesy algorithm, Journal of Computational and Graphical Statistics, vol. 4, no., pp , June 199. [] G. Golub and C. van Loan, Matrix Computation. The Johns Hopins University Press, [1] H. E. Rauch, C. T. Striebel, and F. Tung, Maximum lielihood estimates of linear dynamic systems, American Institute of Aeronautics and Astronautics Journal, vol. 3, no. 8, pp , 196. [] G. Agamennoni. Community profile at MathWors. [Online]. Available: [3] E. Lorenz and K. Emanuel, Optimal sites for supplementary weather observations: Simulation with a small model, Journal of the Atmospheric Sciences, vol., no. 3, pp , February [4] P. Saov, D. Oliver, and L. Bertino, An iterative EnKF for strongly nonlinear systems, Monthly Weather Review, vol. 14, no. 6, pp , June 1. [] G. Evensen, Data Assimilation: The Ensemble Kalman Filter. Springer- Verlag, 7. [6] R. Levine, Z. Yu, W. Hanley, and J. Nitao, Implementing componentwise hastings algorithms, Computational Statistics & Data Analysis, vol. 48, pp ,. [7] K.-T. Fang, S. Kotz, and K. Ng, Symmetric Multi-variate and Related Distributions. Chapman & Hall, [8] S. Julier and J. Uhlmann, A new extension of the Kalman filter to nonlinear systems, in International Symposium on Aerospace and Defense Sensing, Simulation and Control, [9] B. Bell and F. Cathey, The iterated Kalman filter update as a Gauss- Newton method, IEEE Transactions on Automatic Control, vol. 38, no., pp , February 1993.

RECURSIVE OUTLIER-ROBUST FILTERING AND SMOOTHING FOR NONLINEAR SYSTEMS USING THE MULTIVARIATE STUDENT-T DISTRIBUTION

1 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 3 6, 1, SANTANDER, SPAIN RECURSIVE OUTLIER-ROBUST FILTERING AND SMOOTHING FOR NONLINEAR SYSTEMS USING THE MULTIVARIATE STUDENT-T