Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems

Size: px

Start display at page:

Download "Deep Nonlinear Non-Gaussian Filtering for Dynamical Systems"

Charlotte McKinney
5 years ago
Views:

1 Deep Nonlinear Non-Gaussian Filtering for Dnamical Sstems Arash Mehrjou Department of Empirical Inference Ma Planck Institute for Intelligent Sstems Bernhard Schölkopf Department of Empirical Inference Ma Planck Institute for Intelligent Sstems Abstract Filtering is a general name for inferring the states of a dnamical sstem given observations. The most common filtering approach is Gaussian Filtering (GF) where the distribution of the inferred states is a Gaussian whose mean is an affine function of the observations. There are two restrictions in this model: Gaussianit and Affinit. We propose a model to rela both these assumptions based on recent advances in implicit generative models. Empirical results show that the proposed method gives a significant advantage over GF and nonlinear methods based on fied nonlinear kernels. 1 Introduction Inference in dnamical sstems is an standing process in man control sstems. We as intelligent agents are constantl inferring the states of the nature and sstems around us. are often so nois and unreliable that require us to first infer the underling states, then make our decisions based on the estimated states. We can use two sources of information to infer the states causing the current observation: (1) The histor of our estimate of the previous states () The current observation. Fusing these two sources of information to obtain an accurate estimate of the current state of a dnamical sstem is generall called filtering. Dnamical sstems We assume time-invariant closed-loop dnamical sstems described as { t = f( t 1, n t ) t = h( t, m t ) where the subscript t correspond to the current value and subscript t 1 correspond to the values at the moment one step before the current time in discrete setting. This formulation is generic enough for the purposes of this paper; however, the path from the initial formulation to this simplified version can be followed in the Appendi. A. Obviousl, this notation is correct if the sstem satisfies Markov propert. In the above sstem, n t and m t come from some simple noise models. Notice that the simplicit of these noise models is not restrictive because the can be transformed into an comple distribution through the nonlinear functions f and h. In a phsical sstem, the first line of (1) describes p( t t 1 ) as the evolution of the states of the sstem and the second line describes p( t t ) as the probabilistic model of the observations (sensors). Filtering The goal of filtering is to estimate the current state of the sstem. Assume the subscript [: t] refers to all time instances before the moment t including t. At the current moment denoted b subscript t, we have seen the histor of all observations :t = ( :t 1, t ). Thus, the inference over the states of a dnamical sstem can be written as a two-phase process. A prediction phase that models our belief about the net state given onl the previous observations (d is dropped for (1) Accepted to the Workshop on InferControl at 3nd Conference on Neural Information Processing Sstems (NIPS 18). Do not distribute.

2 simplicit throughout the paper): p( t :t 1 ) = p( t t 1 )p( t 1 :t 1 ) t 1 () and an update phase that modulates our belief about the current state through the Baes s formula: p( t :t ) = p( t t )p( t :t 1 ) t p( t t )p( t :t 1 ) (3) Kalman derived the closed-from solution for a linear process Ẋ = AX + Bv and Gaussian noise v N (, I) [1]. The problem is though ver difficult for nonlinear process f and sensor model h unless in ver restricted cases [, 3]. The actual goal of filtering is often not computing the posterior distribution of states. Instead, the goal is computing some epectation E[g( t )] of a function g of the current state with respect to p( t :t ) or p( t, t :t 1 ). The former results in an intractable integral whose computation scales eponentiall with the state dimension dim() []. However, computing the latter scales linearl with dim(). Even though the integral with respect to the probabilit measure p( t, t :t 1 ) is computationall feasible, approimating the probabilit distribution itself is difficult for high dimensional states and observations. This problem has been approached b various methods including Unscented Kalman Filter (UKF) [5], Etended Kalman Filter (EKF) [] and Particle Filter (PF) [7] where the first two are parametric and the last one is non-parametric. Most of the parametric methods adopt a variational approach and approimate p( t, t :t 1 ) b q( t, t :t 1 ) that belongs to a parametric hpothesis space. The assumed form for q must be in a wa that eases the conditioning on t which is readil possible for a Gaussian q. Nonetheless, a Gaussian distribution is not a realistic assumption for p unless in ver limited applications. In this paper, we propose an easil trainable and highl epressive variational distribution and an efficient method to learn its parameters. Gaussian Filtering In common filtering applications, what we usuall care about is an epectation of the following form: E[g( t, t )] = g( t, t )p( t, t :t 1 ) = t, t t,m t g( t, h( t, m t ))p(m t )p( t :t 1 ) () where the right-hand integral is derived b plugging in the observation model of (1) in (). This integral is computable b Monte Carlo methods when the distribution p( t :t 1 ) can be sampled efficientl and noise has a simple model p(m t ). As an special case, integrals with respect to p( t :t 1 ) can be computed efficientl as well. However, it requires p( t t ) to be easil computable from p( t, t ) which is not the case for most distributions ecept ver simple ones such as Gaussians. To ease the presentation, let s focus onl on the prediction step () to compute p( t :t 1 ). The histor of observations :t 1 is implicit in the model. Thus, we drop the indices and represent t b and t b. For eample, p( t :t ) = p( t t, :t 1 ) is simpl represented b p( ). As mentioned in the previous section, filtering tries to find a good approimation to p(, ) and ultimatel p( ). This process is carried out b first approimating p(, ) b q(, ), then computing q( ) given q(, ). The distribution q(, ) is often chosen from a hpothesis space with limited capacit. For a Gaussian hpothesis set, we have (( ) ( µ q(, ) = N µ ), ( )) Σ Σ Σ Σ q( ) = N ( µ + Σ Σ 1 ( µ ), Σ Σ Σ 1 Σ T ) () which is in general called Gaussian Filter (GF). There are two obvious limitations in this framework: First, the posterior distribution () is Gaussian. Second, the mean of the posterior distribution of states which is underlined in () is an affine function of the the observations. In the net section, we rela both these assumptions. (5)

3 t T t T+1 t t T:t () = ( (),z) t T t T+1 (a) State-Observation evolution t z N (,I) (b) The architecture for sampling from the posterior Figure 1: (a) Solid lines show how observations are generated b the evolution of states in Markovian setting. Dashed lines show the non-markovian setting. (b) The observations of T previous timesteps is fed to the network and are transformed to the nonlinear features. The features are concatenated with samples from an eternal source of noise (z) and passed through a nonlinear function whose output is supposed to match the samples from the posterior distribution of states given observations of the last T timesteps. 3 Nonlinear Non-Gaussian Filtering We take a nonlinear approach and directl approimate the conditional distribution p( ) of (3) b Multilaer Perceptron (MLP) as a universal function approimator [8]. In this formulation, q( ) = D( φ()) where φ is a nonlinear function of. Moreover, D can be an comple distribution over belonging to n-dimensional state space. We do not compute D directl. Rather, we generate samples i such that i D. In analog with kernel machines, we call φ : R m R M a feature etractor that transforms the measurements b a nonlinear function from the m-dimensional sensor space to the M-dimensional feature space. Let s assume φ is parameterised b an MLP as φ(; θ φ ). This is after all a deterministic mapping and lacks the required stochasticit. Therefore, we provide the stochastic fuel to q( ) b passing samples z N (, I) alongside the etracted features φ(; θ φ ) through a secondar parameterized function ψ(z, φ(; θ φ ); θ ψ ). Back-propagation is then used to perturb the parameters θ φ and θ ψ to make the output of ψ close to the samples from p( ). The overall architecture partl inspired b [9] is shown in Fig. (b). The dashed arrows in Fig. (a) suggests the possibilit of weakening the Markovian assumption of (1) such that distant states in the past can influence the current observation. Despite the difficult of filtering for non-markovian sstems in other methods [], the proposed method can take care of it simpl b feeding more observations from the past in the network as depicted in Fig. (b). Learning the state posterior The proposed method is epected to accuratel capture the posterior distribution p( ) b q( ) where q( ) is much more fleible than Gaussian. We define the loss l : X X R + {} as a simple Euclidean distance l(, ) = for p( ) and q( ). Since MLP can theoreticall capture arbitraril comple functions [8], we move on one step further and make q((t) (t)) a function of not onl (t) but also a few previous observations of the sstem that turns the implicit distribution into q( t T:t) where T is the approimate time interval in the past through which the observations are informative about the current hidden state of the dnamical sstem. Inspired b [9], given an non-negative smmetric loss function l(, ) for (, ) X X, we define diversit coefficient as (p, q) = E [ E p() [l(, )]] (7) q( t T :t) p( t T :t) On the other hand, due to the uncertaint in the posterior, we know that q( tt T ) should not :t collapse to an etremel low entrop distribution. To encourage the implicitl estimated posterior to have higher entrop, we add a diversit encouraging term to the loss function where similarit among the samples from the same distribution acts as repulsive forces. Hence, the overall loss function becomes L l (q) = l (p, q) λ l (q, q) (8) where λ is a hper-parameter that roughl controls the empirical entrop of the samples generated from the implicit variational posterior q( ). 3

4 8 GF estimation Posterior estimation P ( ) 8 Proposed method s estimation Posterior estimation P ( ) (a) Gaussian filter (b) Proposed method Posterior estimation P ( ) Posterior estimation P ( ) 8 NGF estimation 8 NGF estimation (c) Nonlinear Gaussian Filter (degree=3) (d) Nonlinear Gaussian Filter (degree=7) Figure : Estimated posterior b different methods. Shaded area shows the standard deviation around the mean. Notice that the observation is on the vertical ais, so the uncertaint is the width of the shaded area along the horizontal ais. As can be seen, GF(a) is quite inaccurate since the posterior is Gaussian and its mean is onl an affine function of the observation. NGF(c) with polnomial nonlinearit with degree 3 fits the posterior better than GF but it needs a good choice of nonlinearit to give an acceptable result. Otherwise, the estimated posterior can be too simple or too comple(d). Moreover, this method becomes ver uncertain around = which means it can gives man different values for when [, ]. The proposed method shows a good performance which is caused b two trainable nonlinearit. Roughl speaking, φ approimates the mean and ψ approimates the variance of p( ) corresponding to each observation. Optimization In practice, the loss function (8) is approimated empiricall b the sums: ˆ l (p, q θφ,θ ψ ) = 1 N ˆ l (q θφ,θ ψ, q θφ,θ ψ ) = 1 N N n=1 N n=1 1 K K l( n, ψ(φ( n ; θ φ ), z k ; θ ψ )) k=1 1 K(K 1) K k=1,k k l(ψ(φ( n ; θ φ ), z k ; θ ψ ), ψ(φ( n ; θ φ ), z k ; θ ψ )) and the loss function ˆL(θ φ, θ ψ ) = ˆ l (p, q θφ,θ ψ ) + ˆ l (q θφ,θ ψ, q θφ,θ ψ ) is optimized with respect to {θ φ, θ ψ } b gradient descent. Notice that, for each training eample ( n, n ) from the training set, we need to sample K values of eternal noise z. The greater K results in faster convergence. Eperiments We compare the performance of the proposed method with Gaussian Filter (GF) and also the Nonlinear Gaussian Filter (NGF) [] where a fied nonlinear feature etractor is used to transform sensor measurements to the feature space. The sstem is described b g( t 1, n t ) = t 1 +n t, h( t, m t ) = t +m t +5H( t ) and p( t 1 ) = N ( t 1, 5) where H(.) is the Heaviside step function. See Fig. and its caption for description and Appendi (B) for more details. Conclusion We proposed a method that learns to filter the states of dnamical sstems given previous values of sensor measurements and states. It benefits from the fleibilit of MLPs to deal with major limitations of other methods such as linearit, Gaussianit, and Markovian assumption in a simple unified wa. Notice that the method generates samples from the posterior which is enough for computing the integral (3). The method however cannot be used in possible applications where the evaluation of p( ) is required and also where some samples from (observations, states) pair are not available to accomplish the learning phase.

5 References [1] Rudolph Emil Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 8(1):35 5, 19. [] VE Beneš. Eact finite-dimensional filters for certain diffusions with nonlinear drift. Stochastics: An International Journal of Probabilit and Stochastic Processes, 5(1-):5 9, [3] Frederick Daum. Eact finite-dimensional nonlinear filters. IEEE Transactions on Automatic Control, 31(7):1, 198. [] Manuel Wuethrich, Sebastian Trimpe, Cristina Garcia Cifuentes, Daniel Kappler, and Stefan Schaal. A new perspective and etension of the gaussian filter. The International Journal of Robotics Research, 35(1): , 1. [5] Simon J Julier and Jeffre K Uhlmann. New etension of the kalman filter to nonlinear sstems. In Signal processing, sensor fusion, and target recognition VI, volume 38, pages International Societ for Optics and Photonics, [] Harold Wane Sorenson. Kalman filtering: theor and application. IEEE, [7] Neil J Gordon, David J Salmond, and Adrian FM Smith. Novel approach to nonlinear/nongaussian baesian state estimation. In IEE Proceedings F (Radar and Signal Processing), volume 1, pages IET, [8] Kurt Hornik. Approimation capabilities of multilaer feedforward networks. Neural networks, ():51 57, [9] Diane Bouchacourt, Pawan K Mudigonda, and Sebastian Nowozin. Disco nets: Dissimilarit coefficients networks. In Advances in Neural Information Processing Sstems, pages 35 3, 1. [1] Diederik P Kingma and Jimm Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv:11.98, 1. 5

6 Appendices A Description of dnamical sstems We assume the generic formulation of dnamical sstems as follows { ẋ(t) = f((t), u(t), t) (t) = h((t), u(t), t) (9) As a simplifing assumption, we ignore the eplicit dependence on time for now. Moreover, we assume the sstem is closed loop, i.e., u(t) is designed b state feedback to be a function of states as u((t)). B considering the sstem as time invariant and discrete (which is the case in practice where reading the sensors and issuing control signals are performed b digital sstems), we denote the current moment b t and one timestep before it b t 1. Therefore, the description of the dnamics f and the observation model h are simplified to (1). B Details on the eperiment The eperiment was performed for the following nonlinear stochastic dnamical sstem proposed in []: { ẋt = t 1 + n t (1) t = t + m t + 5H( t ) where the subscripts have the meaning that was described earlier for (1). The state and observation noises are both Gaussian with variances.1 and.3 accordingl. The value of λ in the loss function (8) is set to 1. However, we obtaine comparable results for a range of λ values between.7 to.5. The training set is generated b running the dnamical sstem for 1 time instances starting from a random starting state N (, 1). The generated training set is then used to train the loss function (8) b Adam optimizer [1] with learning rate.5, deca rate.95 per ever 1 iterations and batch size. Training the networks has been continued for approimatel 3 iterations until the value of the parameters converge. Notice that the shaded area shows the standard deviation around the mean. In GF and NGF, standard deviation has a closed-form formula. In the proposed method, since the posterior distribution is implicit and we onl have access to samples generated from the approimated posterior, the variance is computed empiricall using the generated samples from q( ) and plotted to show that the diversit in the generated samples matches the diversit of the actual posterior distribution p( ). Network architecture We used almost the same network architecture for both φ and ψ functions. It consists of two hidden laers each with 18 neurons and tanh nonlinearit. These laers are followed b a linear output laer. The onl difference between the networks realizing φ and ψ is that the former has an output laer with 1 neurons meaning that the feature space to which the sensor measurements are transformed is 1-dimensional. On the other hand, the function ψ obviousl has the same output dimension as the dimension of the state. We did eperiments with several other dnamical sstems with different state/sensor dimensions and constantl observed improvement over GF and NGF.

A New Perspective and Extension of the Gaussian Filter

A New Perspective and Extension of the Gaussian Filter Robotics: Science and Sstems 2015 Rome, Ital, Jul 13-17, 2015 A New Perspective and Etension of the Gaussian Filter Manuel Wüthrich, Sebastian Trimpe, Daniel Kappler and Stefan Schaal Autonomous Motion