The 1d Kalman Filter. 1 Understanding the forward model. Richard Turner

Similar documents
Linear Dynamical Systems

Dimension Reduction. David M. Blei. April 23, 2012

Factor Analysis and Kalman Filtering (11/2/04)

Linear Dynamical Systems (Kalman filter)

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning

State-Space Methods for Inferring Spike Trains from Calcium Imaging

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Sequential Bayesian Updating

State Space and Hidden Markov Models

GAUSSIAN PROCESS REGRESSION

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

PMR Learning as Inference

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

Introduction. Chapter 1

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Probabilistic Graphical Models

STA 4273H: Sta-s-cal Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Nonparametric Bayesian Methods (Gaussian Processes)

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Gaussian Processes for Audio Feature Extraction

Probability theory basics

CPSC 540: Machine Learning

Particle Filters; Simultaneous Localization and Mapping (Intelligent Autonomous Robotics) Subramanian Ramamoorthy School of Informatics

Hidden Markov Models (recap BNs)

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

Math 350: An exploration of HMMs through doodles.

p L yi z n m x N n xi

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Chris Bishop s PRML Ch. 8: Graphical Models

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Linear Regression [DRAFT - In Progress]

CSE 473: Artificial Intelligence

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Chapter 16. Structured Probabilistic Models for Deep Learning

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Bayesian Machine Learning - Lecture 7

MCMC Sampling for Bayesian Inference using L1-type Priors

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Introduction to Machine Learning

I forgot to mention last time: in the Ito formula for two standard processes, putting

Unsupervised Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Introduction to Probabilistic Graphical Models: Exercises

University of Oxford. Statistical Methods Autocorrelation. Identification and Estimation

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Lie Groups for 2D and 3D Transformations

Gaussian Process Approximations of Stochastic Differential Equations

Inference and estimation in probabilistic time series models

Mathematical Formulation of Our Example

Statistics for scientists and engineers

9 Forward-backward algorithm, sum-product on factor graphs

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

14 : Theory of Variational Inference: Inner and Outer Approximation

State Space Representation of Gaussian Processes

Machine Learning for OR & FE

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

This is a Gaussian probability centered around m = 0 (the most probable and mean position is the origin) and the mean square displacement m 2 = n,or

Statistical Inference and Methods

Gaussian Process Regression

Detection theory. H 0 : x[n] = w[n]

Conditional probabilities and graphical models

Probabilistic Reasoning in Deep Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Time Series Prediction by Kalman Smoother with Cross-Validated Noise Density

The Multivariate Gaussian Distribution [DRAFT]

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Fair Exponential Smoothing with Small Alpha

ECE521 week 3: 23/26 January 2017

STA414/2104 Statistical Methods for Machine Learning II

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Hierarchical Modeling for Univariate Spatial Data

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CSC321 Lecture 18: Learning Probabilistic Models

Uncertainty, precision, prediction errors and their relevance to computational psychiatry

1 A Summary of Contrastive Divergence

ECE531: Principles of Detection and Estimation Course Introduction

6.842 Randomness and Computation March 3, Lecture 8

GWAS V: Gaussian processes

Chapter 9. Non-Parametric Density Function Estimation

Naïve Bayes classification

An EM algorithm for Gaussian Markov Random Fields

Statistics and Data Analysis

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Diffusion Monte Carlo

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Computer Intensive Methods in Mathematical Statistics

Fitting a Straight Line to Data

Expectation Propagation in Dynamical Systems

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Parametric Techniques

CS281 Section 4: Factor Analysis and PCA

STA 414/2104: Machine Learning

Dimensions. 1 gt. y(t) = 2

Better Simulation Metamodeling: The Why, What and How of Stochastic Kriging

Chapter 9. Non-Parametric Density Function Estimation

Transcription:

The d Kalman Filter Richard Turner This is a Jekyll and Hyde of a document and should really be split up. We start with Jekyll which contains a very short derivation for the d Kalman filter, the purpose of which is to give intuitions about its more complex cousin. I find the Kalman filter / linear Gaussian state space model thing tough to inutit. This is an attempt to remedy this. Hyde can be found afterwards - here I show the connection between linear Gaussian state space models to Gaussian processes and the Wiener filter. This document is missing some pictures and references, apologies for that. Understanding the forward model Let s define the probabilistic model, together with some ways of interpreting it, and recap the task the Kalman filter solves for us. The d linear Gaussian state space model can be written: P x ) = Normµ, σ 2 ) ) P x t x t ) = Normλx t, σ 2 ) 2) P y t x t ) = Normx t, σ 2 y) 3) Notice we are free to rescale the parameters, and therefore the latents, and so the restriction of the emission weights to unity is unimportant. There are several ways of thinking about this model and the most useful will depend on the situation. Let s think about generating data according to the forward model. At each time step observations are produced like they are in a factor analysis model. However, rather than having uncorrelated latents at each time step drawn from a white Gaussian, the latents are now correlated across time. For instance if λ and σ 2 0 the latents we generate become very smooth and slowly varying, and in the other limit λ 0 and σ 2 ) the latents are highly variable and we recover factor analysis. This model, then, specifies a more general prior distribution over functions for the latents than factor analysis, but otherwise is identical. For the keen, later on I give another more general interpretation for the LGSSM model, but this is really all you need to keep in your head. Let s give a concrete example where this might be a useful model. Imagine you are on a geography field trip trying to map out the depth of a lake along its length. The lake is cold, deep, and murky so you can t see the bottom, nor would you want to swim in it, but you do have a boat. You decide to measure the depth using a long piece of rope with a rock tied to

2 the end. You lower the rope into the water until you feel it hit the bottom, and measure the length of rope to get an estimate of the depth. Being an organised person, you move equal distances along the lake between each measurement. However, due to currents in the water, an inability to determine precisely when the rock hits the bottom, and your boat drifting a bit, the depth measurements are noisy. You estimate that they might fluctuate by as much as half a metre from the true depth. However, your geography teacher tells you that the lakes in this area are known to have smoothly undulating bottoms?). By injecting this prior information - that the depth is unlikely to vary much between two nearby points - you could estimate the depth at any point with an accuracy greater than that intrinsic to the rope and rock. Insert some pictures of sampling from the forward model here 2 Inference Given some data y :T and a setting of the parameters [µ, σ 2, and σ2 y], there are several inferences we may wish to make. We may wish to infer the posterior distribution over one of the latents online: P x t y :t ). This is exactly the problem the Kalman Filter solves. There are other interesting distributions, which it does not return. For instance P x t y :T ) for this we need the Kalman smoother), P x :T y :T ) for this the Wiener filter), and P x t+ y :t ) which crops up any time we want to make predictions eg. GPs). Given the structure of the graph, it seems intuitive that we should be able to use the quantity P x t y :t ) to help us calculate P x t y :t ). If this were true, it would enable us to calculate the distribution of x t recursively and therefore efficiently. Let s see if we can manipulate P x t y :t ) to do this. The general approach is to introduce the latent x t, and use the in)dependencies specified by the graphical model to simplify the result. P x t y :t ) = P x t, x t y :t )dx t 4) = P x t, x t, y :t )dx t 5) P y :t ) = P y :t ) P x t x t )P y t x t )P x t y :t )dx t 6) P y :t ) = Z P y t x t ) P x t x t )P x t y :t )dx t 7) = Z P y t x t )P x t y :t ) 8) = Z likelihood prior 9) The distribution over the current latent is found by combining what the new observation told us about it with what the previous observations told

3 us. Both the likelihood function and the prior are Gaussian balls in latent space, and so too is the distribution over x t. Insert pictures of inference using the likelihood and the prior distributions here Substituting in for these distributions: P x t y :t ) = [ exp Z 2σ 2 x t λx t ) 2 2σy 2 y t x t ) 2 ] ) 2 2σt 2 xt x t dx t 0) = Norm x t, σt 2 ) ) The integral is performed by completing the square in the exponent for the x t terms. Keeping the terms in x t only; P x t y :t ) = [ Z exp [ x 2 t 2 σy 2 λ xt 2x t σ 2 + λ 2 σt 2 + y t σy 2 ]) + σ 2 + λ 2 σt 2 )] 2) 3) From which the means and variances can be recovered by moment matching, giving the Kalman filter recursions: x t = λ x t + K t y t λ x t ) 4) σ 2 t = σ 2 yk t 5) K t = σ 2 + λ 2 σ 2 t σ 2 + σ 2 y + λ 2 σ 2 t 6) These are the celebrated Kalman filter recursions for our simple d model and K t is known as the Kalman gain. We can understand them better by considering the form of the prior distribution. It is exactly as we might have expected: P x t y :t ) = Normλ x t, λ 2 σ 2 t + σ2 ). Nb. If our prior beliefs about x are that they do not blow up exponentially we must have λ.) We interpret the Kalman filter recursions as follows: x t = prior estimate for x t + correction 7) correction = K t ML estimate for x t prior estimate for x t ) 8) σt 2 = ML variance of x t K t 9) K t = prior variance of x t ML variance of x t + prior variance of x t 20) You can check that this does all the right things as the ML variance of x t tends to infinity or zero.

4 3 Connection to GPs Here s the second way of thinking about the model that I promised earlier. It makes the connection with GPs and the Wiener filter. If you are reading this for a simple introduction: you can ignore this section without missing anything.) Earlier we made an analogy to FA where the generation of data y t ) at each time step mapped to a single realisation from the FA model. There is an alternative: We could map a draw from the whole chain y :T ) to a single realisation from the FA model. Different draws from the entire chain are equivalent to different realisations from the FA model. Using the example above, FA would specify a white Gaussian prior over the latents: P x :T ) = I. The linear dynamical state space model specifies a correlated prior parameterised by λ and σ 2. Let s be a little more explicit about this correspondence by finding the precise prior distribution the LDSSM specifies. P x :T ) = P x ) t P x t x t ) 2) t=2 2 = [ Z exp 2 = Z exp [ +x 2 + λ 2 2 σ 2 x 2 σ 2 x 2 + [ σ 2 + 2x 3 x 2 λ )] t σ 2 x t λx t ) 2 22) t=2 ] + λ2 λ σ 2 2x x 2 σ 2 23) σ 2 + + λ 2 )] λ x2 3 σ 2 + 2x 4 x 3 σ 2 +... 24) = Norm0, Σ) 25) Where the last line follows from the fact that a linear transformation of Gaussian random variables is also Gaussian. This makes the connection to FA explicit where Σ = I. In this sense the LGSSM is more general than FA. A quick caveat should be added with regard to the emission weights: the LGSSM specifies causal emissions the emission at time t depends only on the latents at time t), FA does not. For this reason, this model is more restrictive than FA there are lots of zeros in the weight matrix W in P y :T x :T ) = NormW x :T, Φ)). By matching terms we find: Σ = + λ2 λ 0 0... σ 2 σ 2 σ 2 λ +λ 2 λ 0... σ 2 σ 2 σ 2 0 λ +λ 2 λ... σ 2 σ 2 σ 2 0 0 λ +λ 2... σ 2 σ 2....... 26)

5 If σ 2 = σ2 this means: Σ i,j = σ2 λ i j. The covariance of the prior λ 2 is stationary as it must have been due to the shift-invariant specification) and decays exponentially. The power spectrum eigenvalues) of the covariance matrix are therefore so high frequency components decay + ω2 log[λ] 2 away quadratically with a characteristic length log λ, matching the intuititon that values of λ near correspond to smoother priors. The LGSSM therefore specifies a stationary Gaussian process prior over functions for the discretely sampled latents. The Wiener filter is the method for returning the full posterior distribution over the latents P x :T y :T ) for a model with a Gaussian process prior over the latents and Gaussian emissions. There s a document about it on my website too. We have moved from Kalman filters to Gaussian processes, but we can move the other way too. Kalman filters specify their prior using conditional distributions, whereas GPs typically specify the full distribution: P x :T θ GP ) = G x:t 0, C :T ) 27) P x t x t τ:t, θ SSM ) = G xt Λxt τ:t, σ 2) 28) Firstly we should ensure the GP has the same conditional independence relations directed graph) as the SSM. Thus any pair of variables x i, x j should be uncorrelated if i j > τ and then Ci,j = 0. Furthermore, as the SSM specifies a conditional which is shift invariant, C will be Toeplitz and symmetric). Now we find the conditions on the parameters for equivalence by forming the conditional for the GP which is a special case of the predictive distribution). where P x t x t τ:t, θ GP ) = P x t τ:t θ GP ) P x t τ:t θ GP ) P x t : x t2 θ GP ) = 29) P x :T θ GP )dx :t dx t2 +:T 30) if: But from the definition of the GP consistent marginalisation properties), C :T = C :t L M L T C t :t 2 N 3) M T N T C t2 +:T

6 then, P x t : x t2 θ GP ) = G xt :x t2 0, C t :t 2 ) 32) and thus; P x t τ:t θ GP ) = G xt τ:t 0, C t τ:t ) 33) P x t τ:t θ GP ) = G xt τ:t 0, C t τ:t ) 34) where the two covariance matrices are related by: C t τ:t = Ct τ:t ) c c T c 35) Completing the square we find after von Mises, 964); P x t x t τ:t, θ GP ) = G xt c T C t τ:t, c ct C t τ:t c) 36) Therefore the conditions for equivalence are: Λ σ 2 = c T Ct τ:t t 37) = c c T Ct τ:t c t 38) References