Statistical Filtering and Control for AI and Robotics. Part II. Linear methods for regression & Kalman filtering

Similar documents
RECURSIVE ESTIMATION AND KALMAN FILTERING

Linear Methods for Regression. Lijun Zhang

Regression, Ridge Regression, Lasso

Data Mining Stat 588

State Estimation of Linear and Nonlinear Dynamic Systems

ECE521 week 3: 23/26 January 2017

ISyE 691 Data mining and analytics

Overfitting, Bias / Variance Analysis

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Linear Regression and Its Applications

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Brief Review on Estimation Theory

A Modern Look at Classical Multivariate Techniques

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

4 Derivations of the Discrete-Time Kalman Filter

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture 2 Machine Learning Review

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Least Squares Regression

An Introduction to Statistical and Probabilistic Linear Models

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Factor Analysis and Kalman Filtering (11/2/04)

A Probability Review

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

MS-C1620 Statistical inference

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Linear Models in Machine Learning

y(x) = x w + ε(x), (1)

Lecture 2: Repetition of probability theory and statistics

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Recap from previous lecture

Gaussian with mean ( µ ) and standard deviation ( σ)

Introduction to Probabilistic Graphical Models: Exercises

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Least Squares Regression

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

System Identification, Lecture 4

CMU-Q Lecture 24:

Machine Learning 4771

ECE276A: Sensing & Estimation in Robotics Lecture 10: Gaussian Mixture and Particle Filtering

Lecture 4: Least Squares (LS) Estimation

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Density Estimation. Seungjin Choi

Stochastic Processes. M. Sami Fadali Professor of Electrical Engineering University of Nevada, Reno

ECE 636: Systems identification

Association studies and regression

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to Simple Linear Regression

Machine Learning for OR & FE

Gaussian Process Regression

Linear Regression Linear Regression with Shrinkage

Statistics 910, #15 1. Kalman Filter

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Optimal control and estimation

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

2 Statistical Estimation: Basic Concepts

System Identification, Lecture 4

Lecture Note 1: Probability Theory and Statistics

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Chapter 17: Undirected Graphical Models

Machine Learning Basics: Maximum Likelihood Estimation

Review (Probability & Linear Algebra)

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Sparse Linear Models (10/7/13)

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Linear Regression Linear Regression with Shrinkage

Outline lecture 2 2(30)

Introduction to Estimation Methods for Time Series models. Lecture 1

Linear regression COMS 4771

conditional cdf, conditional pdf, total probability theorem?

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Bayesian Machine Learning

A Crash Course on Kalman Filtering

Linear Regression (9/11/13)

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

CMSC858P Supervised Learning Methods

ESTIMATION THEORY. Chapter Estimation of Random Variables

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Bayesian Interpretations of Regularization

Parameter Estimation in a Moving Horizon Perspective

CS-E3210 Machine Learning: Basic Principles

01 Probability Theory and Statistics Review

Linear Models for Regression CS534

GAUSSIAN PROCESS REGRESSION

Transcription:

Statistical Filtering and Control for AI and Robotics Part II. Linear methods for regression & Kalman filtering Riccardo Muradore 1 / 66

Outline Linear Methods for Regression Gaussian filter Stochastic model Kalman filtering Kalman smoother 2 / 66

References These lectures are based on the following books Sebastian Thrun, Wolfram Burgard and Dieter Fox, ProbabilisticRobotics, MIT Press, 2005 Trevor Hastie, Robert Tibshirani and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009 Several pictures from those books have been copied and pasted here 3 / 66

Linear Methods for Regression 4 / 66

probability Supervised learning: use the inputs (i.e. predictors, independent variables, features) to predict the values of the outputs (i.e. responses, dependent variables) This distinction in output type has led to a naming convention for the prediction tasks: regression when we predict quantitative outputs, and classification when we predict qualitative outputs. Notation: x R m random variable (x i R is its i-th component) x R m an observation of the random variable x R m (x i R is its i-th component) X R m N a collection of N observations (X T i R m is its i-th row) We will focus on the regression problem: this means that input and output vectors consist of qualitative measurements 5 / 66

Linear Models Input: x R m, x R m, X R N m Output: y R p, y R p, Y R N p Prediction: ŷ R p, ŷ R p, Ŷ Rp N Linear Model: ( from now on p = 1 ) where β R m Prediction y = f (x) = x T β ŷ = x T ˆβ where ˆβ R m is the matrix of coefficients that we have to determine Remark. If p = 1, the gradient f (x) = x f (x) = β is a vector pointing in the steepest uphill direction 6 / 66

Least Squares Let X R N m and Y R N a training set of data (collection of N pairs (x, y)) How to choice β? First of all we have to introduce an index as a function of β. Let RSS(β) be the residual sum of squares RSS(β) := We search for N (Y i X i β) T (Y i X i β) = (Y X β) T (Y X β) i=1 ˆβ := arg min RSS(β) β Computing the first and second derivative we get the normal equations β RSS(β) = 2X T (Y X β) 2 ββrss(β) = 2X T X 7 / 66

Least Squares If X T X is nonsingular (i.e. X has full column rank), the unique solution is given by the normal equations i.e. β RSS(β) = 0 X T (Y X β) = 0 ˆβ = (X T X ) 1 X T Y and the prediction of y given a new value x is ŷ = x T ˆβ Observations: We assume that the underlying model is linear Statistics of x and y do not play any role (it seems...) 8 / 66

Least Squares p > 1 Linear model Y = XB + E where X R N m, Y R N p, E R N p and B R m p The RSS takes the form RSS(B) := trace{(y XB) T (Y XB)} and the least square estimation of B is written in the same way ˆB = (X T X ) 1 X T Y Multiple outputs do not affect one another s least squares estimates If the component of the vector r.v e are correlated, i.e. e N (0, Σ), then we can define a weighted RSS RSS(B, Σ) := N (Y i X i B) T Σ 1 (Y i X i B) i=1 9 / 66

Geometric interpretation The normal equations X T (Y X β) = 0 means the estimation Ŷ = X ˆβ = X (X T X ) 1 X T Y is the orthogonal projection of Y into the subspace X 10 / 66

Statistical interpretation We now consider the r.v. x and y as input and output, respectively, and we seek a function f (x) for predicting y. The criterion should be now deal with stochastic quantities: we introduce the expected squared prediction error EPE (strictly related with the mean squared error MSE) EPE(f ) := E [ (y f (x)) T (y f (x)) ] = (y f (x)) T (y f (x))p(x, y)dxdy S x,s y where we implicitly assumed that x and y have a joint PDF. EPE(f ) is a L 2 loss function Conditioning on x we can re-write EPE(f ) as EPE(f ) := E x [ Ey x [ (y f (x)) T (y f (x)) x ] ] 11 / 66

Statistical interpretation We can determine f ( ) pointwise which means that f (x) = arg min c E y x [ (y c) T (y c) x = x ] f (x) = E [y x = x] i.e. the best f (x) is the conditional mean (according to the EPE criterion). Beautiful but, given the data X, Y how can we compute the conditional expectation?!? 12 / 66

Statistical interpretation Let us assume again f (x) = x T β then EPE(f ) := E [ (y x T β) T (y x T β) ] Differentiating w.r.t. β we end up with β = ( E[xx T ] ) 1 E[x T y] Computing the auto- and cross-correlation (i.e. using real numbers!) E[xx T ] N S xx := 1 N E[x T y] N S xy := 1 N N i=1 N i=1 Xi T X i = 1 N X T X X i Y T i = 1 N XY T 13 / 66

Statistical interpretation Then we get ˆβ = ( ) 1 1 N X T 1 X N XY T = ( X T X ) 1 XY T Again the normal equations!!! But now we can provide a statistical interpretation of ˆβ. Let y = x T β + e, e N (0, σ 2 ) be our model (p = 1), then ˆβ is a Gaussian variable ˆβ N (β, (X T X ) 1 σ 2 ) In fact, since ˆβ = ( X T X ) 1 X y ( X T X ) 1 X e ŷ = x T ˆβ + e 14 / 66

Gauss-Markov theorem Given the linear model y = x T β, Y = X β the least squares estimator ˆφ(x 0 ) = x0 T ˆβ of φ(x 0 ) = x0 T β is unbiased because E[x T 0 ˆβ] = x T 0 β Theorem If φ(x 0 ) is any other unbiased estimation (E[ φ(x 0 )] = x0 T β) then Var( ˆφ(x 0 )) Var( φ(x 0 )) Remark. Mean square error of a generic estimator φ (p = 1) MSE( φ) = E[( φ φ) 2 ] ( ) = Var( }{{ φ) + (E[ } φ] φ) 2 }{{} variance bias (*) = sum and subtract E[ φ]. 15 / 66

Gauss-Markov theorem Given the stochastic linear model y = x T β + e, e N (0, σ 2 ) and let φ(x 0 ) be the estimator for y 0 = φ(x 0 ) + e 0, φ(x 0 ) = x0 T β. The expected prediction error (EPE) of φ(x 0 ) is EPE( φ(x 0 )) = E[(y 0 φ(x 0 )) 2 ] = σ 2 + E[(x T 0 β φ(x 0 )) 2 ] = σ 2 + Var( φ) + (E[ φ] φ) 2 }{{} MSE 16 / 66

Bias-variance trade-off underfitting VS overfitting 17 / 66

Statistical models Statistical model: y = f (x) + e where y is a random error with zero mean (E[e] = 0) and is independent of x. This means that the relationship between y and x is not deterministic (f ( )) The additive r.v. e takes care of measurement noise, model uncertainty and non measured variables correlated with y as well We often assume that the random variables e are independent and identically distributed (i.i.d.) 18 / 66

Statistical models Assuming a linear basis expansion for f θ (x) parametrized by the unknowns collected within the vector θ K f θ (x) = h k (x)θ k where examples of h k (x) can be 1 h k (x) = x k h k (x) = (x k ) 2 h k (x) = sin(x k ) h k (x) = 1 1 + e x T β k The optimization problem to solve is ˆθ = arg min RSS(θ) = N (y i f θ (x i )) 2 θ Θ where RSS stands for Residual Sum of Squares 1 19 / 66

Statistical models Are there other kinds of criterion besides RSS, EPE? YES, A more general principle for estimation is maximum likelihood estimation Let p θ (y) be the PDF of the samples y 1,..., y N The log-probability (or log-likelihood) of the observed samples is L(θ) = N log p θ (y i ) 1 Principle of maximum likelihood: the most reasonable values for θ are those for which the probability of the observed samples is largest 20 / 66

Statistical models If the error e in the following statistical model y = f θ (x) + e is Gaussian, e N (0, σ 2 ), then the conditional probability is Then log-likelihood of the data is L(θ) = p(y x, θ) N (f θ (x), σ 2 ) N log p(y i f θ (x i ), θ) 1 = N 2 log(2π) N log σ 1 2σ 2 N i=1 (y i f θ (x i )) 2 Least squares for the additive error model is equivalent to maximum likelihood using the conditional probability (The yellow is the RSS(θ) ) 21 / 66

Penalty function and Regularization methods Penalty function, or regularization methods, introduces our knowledge about the type of functions f (x) we are looking for PRSS(f, λ) := RSS(f ) + λg(f ) where the functional g(f ) will force our knowledge (or desiderata) on f Example. One-dimension cubic smoothing spline is the solution of PRSS(f, λ) := N (y i f (x i )) 2 + λ [f (s)] 2 dx i=1 Remark. Penalty function methods have a Bayesian interpretation: g(f ) is the log-prior distribution PRSS(f, λ) is the log-posterior distribution the solution of arg min f PRSS(f, λ) is the posterior mode 22 / 66

Kernel Methods and Local Regression If we want a local regression estimation of f (x 0 ), we have to solve the problem ˆθ = arg min θ RSS(f θ, x 0 ) = N K λ (x 0, x i )(y i f θ (x i )) 2 i=1 where the kernel function K λ (x 0, x) weights the point x around x 0. The optimal estimation is fˆθ(x 0 ) An example of kernel function is the Gaussian kernel K λ (x 0, x) = 1 [ λ exp x x 0 2 ] Examples of f θ (x) are f θ (x) = θ 0, constant function f θ (x) = θ 0 + θ 1 x, linear regression 2λ 23 / 66

Basis functions The function f can be approximated using a set of M basis functions h m M f θ (x) = θ m h m (x) where θ = [θ 1 θ M ] Examples of basis functions: m=1 Radial basis functions: M f θ (x) = θ m K λm (µ m, x), K λ (µ, x) = e x µ 2 /2λ m=1 Single-layer feed-forward neural network f θ (x) = M θ m σ(αmx T 1 + b m ), σ(x) = 1 + e x m=1 Remark. Linear methods can then be used with nonlinear input-output transformation because the model is linear in the parameters θ 24 / 66

Subset selection The least squares estimates often have low bias but large variance. Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero. By doing so we sacrifice a little bit of bias to reduce the variance of the predicted values, and hence may improve the overall prediction accuracy. 25 / 66

Ridge Regression Ridge regression shrinks the regression coefficients by imposing a penalty on their size. The coefficients ˆβ ridge are obtained solving the minimization problem ˆβ ridge = arg min β { N (Y i X i β) T (Y i X i β) +λ i=1 } {{ } RSS(β) with λ 0, or the equivalent constrained problem The solution is m i=1 β 2 i }{{} g(β)=β T β N ˆβ ridge = arg min β (Y i X i β) T (Y i X i β) s. to i=1 m i=1 β 2 i t ˆβ ridge = (X T X + λi ) 1 X T Y } 26 / 66

Lasso The coefficients ˆβ lasso are obtained solving the minimization problem N m ˆβ lasso = arg min{ (Y i X i β) T (Y i X i β) +λ β i } β i=1 } {{ } i=1 }{{} RSS(β) g(β) with λ 0, or the equivalent constrained problem N ˆβ lasso = arg min β (Y i X i β) T (Y i X i β) s. to i=1 m β i t i=1 The are no closed form expression for ˆβ lasso Remark 1. The Ridge Regression uses a L 2 norm on β, whereas Lasso the L 1 norm. This means that the solution is nonlinear in the data. Remark 1. Decreasing t forces some of the coefficients to be set to zero (exactly). 27 / 66

Ridge regression VS Lasso Lasso Ridge β 1 + β 2 t β 2 1 + β2 2 t2 The red ellipses are the contours of the least squares error function 28 / 66

From r.v. to stochastic processes Definition (random variable) A random variable x: Ω E is a measurable function from the set of possible outcomes Ω to some set E. Ω is a probability space and E is a measurable space. Roughly speaking: A random variable x is a rule for assigning to every outcome ω of an experiments a number x(ω) Definition (stochastic process) Given a probability space (Ω, F, P) and a measurable space (S, Σ), an S-valued stochastic process is a collection of S-valued random variables on Ω, indexed by a totally ordered set T ( time ). That is, a stochastic process is a collection {x t : t T } where each x t is an S-valued random variable on Ω. The space S is then called the state space of the process. Roughly speaking: A stochastic process x t is a rule for assigning to every outcome ω a function x(t, ω) 29 / 66

From r.v. to stochastic processes {x t } has the following interpretations: It is a family of functions x t (ω) when t and ω are variables. It is a single time function (or a realization of the given process) x t ( ω) when t is a variable and ω = ω is fixed. It is a random variable if t = t is fixed and ω is variable, i.e. x t(ω) state of the process at time t. It is a number if t and ω are fixed If T = R, {x t } is a continuous-time process If T = Z, {x k } is a discrete-time process Even though the dynamics of the system is described by ODE, in the following we will consider discrete-time processes because the sensing system provides measurements at discrete moments. Remark The r.v. x k(ω) can be continuous even if k T = Z 30 / 66

Gaussian filter 31 / 66

State estimation Given the measurement y k, k = 0, 1,... related to an unknown state variable x k, we are interested in the following estimation problems - Filtering - h-step ahead Prediction y 0, y 1,..., y k ˆx k k - h-step backward Smoothing y 0, y 1,..., y k ˆx k+h k - Smoothing y 0, y 1,..., y k ˆx k h k y 0, y 1,..., y N ˆx k N 32 / 66

Gaussian filters Gaussian filters assume that the undergoing phenomena can be modeled by Gaussian distributions. This assumption allows to solve in recursive way the general Bayes filters formulation Why are Gaussian distributions so good? Gaussians are unimodal: they have a single maximum The statistics (mean, variance and higher order moments) of a Gaussian are described by two parameters: its mean and variance The linear combination of Gaussians is still Gaussian 33 / 66

Gaussian Random Variables Definition (Gaussian r.v.) An n-dimensional random variable X is Gaussian with mean µ R n and variance Σ R n n, Σ = Σ T > 0, X N (µ, Σ), if its probability density function (PDF) is given by p(x) = 1 (2π)n det Σ e 1 2 (x µ)t Σ 1 (x µ). This means that µ = E[X ] Σ = Var(X ) = E[(X E[X ])(X E[X ]) T ]. 34 / 66

Gaussian r.v. Theorem (Joint Gaussian r.v.) Let x R n and y R m be ([ joint Gaussian ] [ ]) µx Σxx Σ p(x, y) N, xy µ y Σ yx Σ yy Then the r.v. z = Ax + By is still Gaussian, i.e. z N (µ z, Σ z ), where µ z = E[Ax + By] = Aµ x + Bµ y Σ z = E [(Ax ] + By Aµ x Bµ y ) (Ax + By Aµ x Bµ y ) T [ ( [ ]) ( [ ]) ] T x µx x µx = E [A B] [A B] y µ y y µ y [ [ ] [ ] ] T x µx x µx = [A B]E [A B] T y µ y y µ y [ ] [ ] Σxx Σ = [A B] xy A T Σ yx Σ yy B T 35 / 66

Gaussian r.v. Theorem (...) the Gaussian random variable x conditioned on the Gaussian random variable y is still a Gaussian random variable. The PDF of x given y is where if Σ yy > 0. p(x y) = N (µ x y, Σ x y ) (1) µ x y = µ x + Σ xy Σ 1 yy (y µ y ) (2) Σ x y = Σ xx Σ xy Σ 1 yy Σ yx (3) 36 / 66

Optimal Estimator Theorem (Minimum Variance Estimator) Let x R n, y R m be two r.v. (non necessariamente Gaussiane), and g : R m R n a measurable function. We define ˆx g = g(y) as the estimator of x given y through the function g, and e g = x g(y) = x ˆx g the corresponding estimation error. The estimator ˆx = E[x y] = ĝ(y) is optimal because it minimizes the error variance, i.e. Var(e) = E[(x ˆx)(x ˆx) T ] E[(x ˆx g )(x ˆx g ) T ] = Var(e g ), g( ) where e = x ˆx is the error of the optimal estimator. The error of the optimal estimator and the estimation are uncorrelated E[eĝ(y) T ] = 0. 37 / 66

Stochastic model 38 / 66

LTI Stochastic model We focus now on the state-space representation of a generic Linear Time-Invariant (LTI) stochastic model: { xk+1 = Ax k + w k where: y k = Cx k + v k v k N (0, R), E[v k vh T ] = Rδ(k h) w k N (0, Q), E[w k wh T ] = Qδ(k h) x 0 N ( x 0, P 0 ) and v k, w k, x 0 are uncorrelated zero-mean Gaussian r.v. E[v k w T h ] = 0 E[x 0 v T k ] = 0 E[x 0 w T k ] = 0 The state-space model is a way to describe the dynamical evolution of a stochastic process 39 / 66

LTI Stochastic model From the evolution of the state and of the output at time t we also have x k = A k k0 x 0 + y k = CA k k0 x 0 + k 1 i=k 0 A k i 1 w i k 1 i=k 0 CA k i 1 w i + v k E[x k w T h ] = 0, h k E[x k v T h ] = 0 E[y k v T h ] = Qδ(k h) 40 / 66

Kalman filtering 41 / 66

Kalman filter The Kalman filter (or minimum variance filter) is defined as: ˆx k+1 k+1 = E [x k+1 y 0,..., y k+1 ] = E [ x k+1 y k+1, Y k] (4) where Y k = (y k,..., y 1, y 0 ). Goal: we need a recursive expression for ˆx k+1 k+1 without using µ x y = µ x + Σ xy Σ 1 yy (y µ y ) (5) at any time instant k, i.e. when a new measurement is available. The explicit expression for E [X Y ] is easy to derive from [ (5) if X e ] Y ΣXX Σ are joint Gaussian with means µ X, µ Y and variances XY. Σ YX Σ YY 42 / 66

Kalman filter To rewrite ˆx k+1 k+1 = E [x k+1 y 0,..., y k+1 ] = E [ x k+1 y k+1, Y k] in the form E [X Y ] we introduce the following conditional random variables X = x k+1 Y k Y = y k+1 Y k and compute the following means, variances and covariances: µ X = E [ x k+1 Y k] µ Y = E [ y k+1 Y k] P k+1 k = Σ XX = Var [ x k+1 Y k] Σ YY = Var [ y k+1 Y k] Σ XY = Σ T YX = Cov [ x k+1, y k+1 Y k]. 43 / 66

Kalman filter The optimal estimator is given by: E [X Y ] = ˆx k+1 k+1 = ˆx k+1 k + Σ XY Σ 1 YY ( yk+1 ŷ k+1 k ) (6) and the variance of the estimation error is Σ X Y = P k+1 k+1 = Σ XX Σ XY Σ 1 YY Σ YX (7) 44 / 66

Conditional means Mean µ X µ X = E [ x k+1 Y k] = E [ Ax k + w k Y k] = AE [ x k Y k] + E [ w k Y k] = Aˆx k k = ˆx k+1 k Mean µ Y µ Y = E [ y k+1 Y k] = E [ Cx k+1 + v k+1 Y k] = CE [ x k+1 Y k] + E [ v k+1 Y k] = C ˆx k+1 k = CAˆx k k 45 / 66

Conditional variances Variance Σ XX Σ XX = Var [ x k+1 Y k] [ (xk+1 ) ( ) ] T = E ˆx k+1 k xk+1 ˆx k+1 k Y k [ (Axk ) ( ) ] T = E + w k Aˆx k k Axk + w k Aˆx k k Y k [ (xk ) ( ) ] T = AE ˆx k k xk ˆx k k Y k A T + +AE [( ) x k ˆx k k w T k Y k] + ( ) +E [w ] T k xk ˆx k k Y k A T + E [ w k wk T Y k] = AP k k A T + Q = P k+1 k 46 / 66

Conditional variances Variance Σ YY Σ YY = Var [ y k+1 Y k] [ (yk+1 ) ( ) ] T = E ŷ k+1 k yk+1 ŷ k+1 k Y k [ (Cxk+1 ) ( ) ] T = E + v k+1 C ˆx k+1 k Cxk+1 + v k+1 C ˆx k+1 k Y k [ (xk+1 ) ( ) ] T = CE ˆx k+1 k xk+1 ˆx k+1 k Y k C T + +CE [( ) x k+1 ˆx k+1 k v T k+1 Y k] + ( ) +E [v ] T k+1 xk+1 ˆx k+1 k Y k C T + E [ v k+1 vk+1 Y T k] = CP k+1 k C T + R 47 / 66

Conditional variances Covariance Σ XY = Σ T YX Σ XY = Cov [ x k+1, y k+1 Y k] [ (xk+1 ) ( ) ] T = E ˆx k+1 k yk+1 ŷ k+1 k Y k [ (Axk ) ( ) ] T = E Aˆx k k + w k CAxk CAˆx k k + v k+1 + Cw k Y k [ (xk ) ( ) ] T = AE ˆx k k xk ˆx k k Y k A T C T + E [ w k wk T Y k] C T = AP k k A T C T + QC T = P k+1 k C T 48 / 66

Summing up The random variable z = PDF p ( z Y k) N ([ ˆxk+1 k C ˆx k+1 k [ ] xk+1 conditioned on Y y k, has the following k+1 ] [ Pk+1 k P, k+1 k C T CP k+1 k CP k+1 k C T + R ]) with: ˆx k+1 k = Aˆx k k P k+1 k = AP k k A T + Q 49 / 66

Recursive formulation The last step is to compute p ( x k+1 Y k+1) N (ˆx k+1 k+1, P k+1 k+1 ) where the mean ˆx k+1 k+1 is the optimal estimation we are looking for and P k+1 k+1 the variance of the corresponding estimation error. Substituting the previous expression we end up with ˆx k+1 k+1 = ˆx k+1 k + P k+1 k C T ( CP k+1 k C T + R ) 1 ( yk+1 C ˆx k+1 k ) The Kalman gain is the matrix K k+1 = P k+1 k C T ( CP k+1 k C T + R ) 1 mapping the output estimation error into the correction of the prediction state ˆx k+1 k+1 = ˆx k+1 k + K k+1 ( yk+1 C ˆx k+1 k ) The variance of the estimation error is P k+1 k+1 = P k+1 k P k+1 k C T ( CP k+1 k C T + R ) 1 CPk+1 k 50 / 66

Recursive formulation Prediction step / A priori estimation ˆx k+1 k = Aˆx k k P k+1 k = AP k k A T + Q Estimation step / A posteriori estimation ˆx k+1 k+1 = ˆx k+1 k + K k+1 (y k+1 C ˆx k+1 k ) P k+1 k+1 = P k+1 k P k+1 k C T (CP k+1 k C T + R) 1 CP k+1 k Initial conditions ˆx 0 1 = x 0 P 0 1 = P 0 51 / 66

Kalman filter/predictor Kalman filter ˆx k+1 k+1 = Aˆx k k + K k+1 (y k+1 CAˆx k k ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q with ( ) 1 K k+1 = P k+1 k C T CP k+1 k C T + R The matrix recursive equation P k+1 k =... is called Riccati equation. Kalman predictor ˆx k+1 k = Aˆx k k 1 + K k (y k C ˆx k k 1 ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q with ( ) 1 K k = AP k k 1 C T CP k k 1 C T + R 52 / 66

Kalman filter/predictor Observations 1. All the information in y(i), i [0, k 1] is contained in the estimation state ˆx k 1 k 1 : the following conditional expectations are equal E[x k y k, y k 1,..., y 0 ] = E[x k ˆx k 1 k 1, y k ] 2. The optimal gain K k is time-varying even if the stochastic model is LTI. 3. There is a more general formulation of the Kalman filter where w k and v k are correlated. 4. The same recursive equation for the Kalman filter can be used with linear time-varying stochastic systems. 53 / 66

Steady-state Kalman filter What s happen when k? Does the estimation error converge to zero with minimal variance? Theorem Given the stochastic LTI model { xk+1 = Ax k + w k y k = Cx k + v k (8) with v k N (0, R), E[v k vh T ] = Rδ(k h) w k N (0, Q), E[w k wh T ] = Qδ(k h) x 0 N ( x 0, P 0) where v k, w k, x 0 are zero mean uncorrelated Gaussian random variables. (9) 54 / 66

Steady-state Kalman filter Theorem (...) Then 1. The Algebraic Riccati Equation (ARE): P = AP A T AP C T (CP C T + R) 1 CP A T + Q has a unique positive definite symmetric matrix solution P = P T > 0 2. P is stabilizable, i.e. (A K C) is asymptotically stable with ( 1 K = P C T CP C T + R). 3. lim k P(k k 1) = P holds for all initial conditions P(0 1) = P 0 = P T 0 0, if and only if 1. (A, C) is detectable, 2. (A, Q 1/2 ) is stabilizable. 55 / 66

Steady-state Kalman filter Kalman filter (LTI) ˆx k+1 k+1 = Aˆx k k + K (y k+1 CAˆx k k ) P = APA T APC T (CPC T + R) 1 CPA T + Q with ( ) 1 K = PC T CPC T + R Kalman predictor (LTI) ˆx k+1 k = Aˆx k k 1 + K (y k C ˆx k k 1 ) P = APA T APC T (CPC T + R) 1 CPA T + Q with K ( 1 = APC T CPC T + R). 56 / 66

Kalman smoother 57 / 66

Kalman Smoother Model: {A, C, Q, R} x k+1 = Ax k + w k y k = CX k + v k Data: sequence of N samples of the output y 0, y 1,..., y N STEP 1 Standard Kalman filtering (forward step) ˆx f k+1 k+1 = Aˆx f k k + K k+1(y k+1 CAˆx f k k ) ˆx f 0 0 = x 0 P f k k =... P f 0 0 = P 0 58 / 66

Kalman Smoother STEP 2 Smoothing (backward step) ˆx s k N = ˆx f k k + K k [ˆx s k+1 N ˆx f k+1 t ˆx s N N = ˆx f N N where the conditional covariance matrix P(t N) satisfies the time-backward matrix equation P k N = Pk k f + K ] k [P k+1 N Pk+1 k f P N N = PN N f. with K k = P f k k AT ( P f k+1 k ) 1 ] 59 / 66

Problem: speed estimation Given the measurement equation where y(t) = s(t) + v(t) s(t) is the signal we are interesting in (e.g. angular position), y(t) is the measurement given by a sensor (e.g. an encoder) v(t) is the addictive measurement noise We can face different kinds of estimation problems: - Filtering: determine the best estimation ŝ(t) of s(t) based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) - Prediction: determine the best estimation ŝ(t + h) of s(t + h) with h 1 based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) - Smoothing: determine the best estimation ŝ(t h) of s(t h) with h 1 based on the measurements y( ) till time t (i.e. y(0), y(1),..., y(t)) 60 / 66

Problem: speed estimation Let θ and ω be the angular position and velocity, respectively. Knowing nothing about the physical model that produces the signal s(t) we set the derivative of the velocity equal to a white noise. A stochastic process n is called white noise if its values n(t i ) and n(t j ) are uncorrelated i j, i.e. Corr{n(t i ), n(t j )} = Q(t i )δ(t i t j ) We also assume the n(t) is Gaussian with zero-mean and constant variance Q R for all t Kinematic model Measurement equation θ(t) = ω(t) ω(t) = n(t) y(t) = θ(t) + v(t) where v is another white noise and v(t) is Gaussian with zero-mean and constant variance R R. 61 / 66

Problem: speed estimation Kinematic model θ(t) = ω(t) ω(t) = n(t) Measurement equation y(t) = θ(t) + v(t) Remark. The process n takes into account the uncertainty on the model of the system, whereas v models the measurement noise superimposed to the real value θ Let x(t) be the vector state x(t) := [ ] θ(t) ω(t) 62 / 66

Problem: speed estimation The continuous-time state space model of our basic system is [ ] [ 0 1 0 ẋ(t) = x(t) + n(t) 0 0 1] y(t) = [ 1 0 ] x(t) + v(t) Its discrete-time approximation with sample time T s is given by [ ] [ ] 1 T 2 Ts s x k+1 = x 0 1 k + 2 n k T s y k = [ 1 0 ] x k + v k 63 / 66

Problem: speed estimation The relationship between the two state space models [ ] [ ] 1 T 2 Ts s { x k+1 = x 0 1 k + 2 n k xk+1 = Ax T s y k = [ 1 0 ], k + w k y k = Cx k + v k x k + v k is [ ] 1 Ts A := 0 1 [ ] [ T 2 s w k := 2 n k N 0, T s 0] C := [ 1 0 ] v k := v k [ ] [ T 2 s 2 T s T 2 s 2 T s ] T Q Tuning of the filter The variance R depends on the encoder resolution (we can read it on the datasheet) whereas the matrix Q is chosen by the designer to try to explain the measurements in the best way. 64 / 66

Model with inputs If an input command u k enters within the stochastic model { xk+1 = Ax k + Bu k + w k y k = Cx k + v k, how do the filter equations change? Fortunately if u k is a function of past measurements (e.g. u k = f (y 0:k )) than we can simple add the term Bu k in the recursive equations: ˆx k+1 k = Aˆx k k 1 + Bu k + K k (y k C ˆx k k 1 ) P k+1 k = AP k k 1 A T AP k k 1 C T (CP k k 1 C T + R) 1 CP k k 1 A T + Q or ˆx k+1 k = Aˆx k k 1 + Bu k + K (y k C ˆx k k 1 ) P = APA T APC T (CPC T + R) 1 CPA T + Q 65 / 66

Important remark In the book Probabilistic Robotics the Kalman equations are obtained following a different approach (using first and second derivatives of opportune quadratic functions). Another difference is that they start with the linear Gaussian system which is the same of Observations: x k = Ax k 1 + Bu k + w k x k+1 = Ax k + Bu k+1 + w k+1 using w k+1 instead of w k does not change anything because w is white noise on the other hand, it would make a big difference using Bu k+1 instead of Bu k as we did: for this reason the authors of the book introduce the assumption that the control input u is a random process independent of the state and the measurement 66 / 66