A Theoretical Overview on Kalman Filtering

Size: px

Start display at page:

Download "A Theoretical Overview on Kalman Filtering"

James Parrish
5 years ago
Views:

1 A Theoretical Overview on Kalman Filtering Constantinos Mavroeidis Vanier College Presented to professors: IVANOV T. IVAN STAHN CHRISTIAN June 6, 208 Abstract Kalman filtering is a state estimation algorithm used in tracking and data prediction tasks. Common applications of the Kalman filter consist of machine navigation, localization systems of autonomous vehicles [8], statistical signal processing, prediction of future network congestions [], spacecraft orbit calculations and trajectory optimization [7], econometrics [2], and biometrics [3]. The standard Kalman filter is relies on a linear dynamical system with discretized time. It is most commonly used in GPS positioning and computer system analysis, and has important applications in prediction systems of network congestions []. There are many extensions of the algorithm that have been developed for non-linear systems and for systems that do not follow assumptions of the standard Kalman filter. The Extended Kalman Filter[9] and the Unscented Kalman filter[0] have been developed to encompass the non-linear properties of certain systems. In this paper, we focus on a linear underlying system with normally distributed noise, which is the most common case in simple applications. We first introduce any statistical background necessary to understand the theoretical concepts behind Kalman filtering, assuming the reader has a general understanding of probability-theory, linear dynamical systems and basic calculus. After introducing the statistical background necessary to understand the math behind the algorithm, we illustrate the essence of these theoretical results using a basic state estimation example of a linear system. The state observation and estimation models developed in this paper can effectively be used in other, similar types of applications with a strictly underlying linear dynamical system. Introduction Kalman filtering is an optimal state estimation process proposed in 960 by Rudolf E. Kalman as a solution to the filtering problem of a linear, discretized nature [-4]. It is widely used in the engineering field, most commonly in spacecraft navigation and trajectory optimization for robots. It is a useful tool due to its recursive process and its little memory requirements, and gives most optimal state estimations even when a part of the system is hidden. In applications, part of a system s state is usually not observable, so the best estimate of the hidden portion of the state can be determined using the Kalman filter. A variation of the standard algorithm, presently known as the Extended Kalman filter (EKF) [5-6] was initially used in the aerospace industry, more specifically, for the non-linear trajectory outlined by the Apollo Navigation Compututer during the Apollo program [7]. The EKF /textitlinearizes the non-linear system and forces the system to work with the linearized version of the model. Some applications of the EKF include constructing localization systems for autonomous vehicles [8]. Despite its ability to linearize non-linear models, the math behind the EKF can be rather complex, making it difficult to implement such algorithm with efficiency [9]. Another extension of the Kalman filter, the Unscented Kalman filter (UKF) [0], was therefore developed to overcome such difficulties using non-linear estimation techniques that have been shown to be more effective than the EKF in many applications, including vehicle navigation [9]. Despite the wide range of applications of these extensions of the standard Kalman Filter, many dynamical systems can be modeled linearly, including position and motion estimation of a road vehicle. Obtaining a highly accurate estimation of the vehicle s actual position through the algorithm is crucial in developing trustworthy GPS systems. In this paper, we introduce and expand on the fundamental ideas behind the Kalman filter and how it can be used to obtain the most optimal state of a linear system with Gaussian noise. Kalman filtering is an algorithm that combines imprecise estimates of some true value, either observable or not, and yields the most precise estimate of that value. Informally, this can be illustrated through the problem of choosing the most trustworthy value out of two proposed values. For instance, we can measure the temperature across a resistor s terminals using different types of thermometers. For now, we will use the word independent to indicate that the value displayed by each thermometer does not affect the value displayed by another. Suppose we obtain different results from different

2 thermometers. How does one decide which is the most accurate temperature across the resistor s terminals? Rather than choosing the value displayed by one thermometer, we can combine all the displayed values using a linear estimator, and allocate weights to each thermometer s displayed value based on how trustworthy a thermometer is. This idea informally defines confidence in an estimate. When dealing with a simple case of only two thermometers, we might be inclined to take the average of the two temperatures. If thermometer displays a temperature estimate x and thermometer 2 displays an estimate x 2, we can combine both estimates using the formula 0.5 x x 2, where each estimate is given equal weight. If we have additional information on the quality of each thermometer, more specifically, if one is newer or more appropriate for such types of measurements, we may have more confidence in the resukts displayed by that thermometer. We can generally consider a linear combination of the two estimates given by an expression of the form ( α) x + α x 2 where 0 α. Intuitively, the more confidence we have in the second estimate x 2, the closer α should be to, where x no longer contributes to the equation. The equation ( α) x + α x 2 is an example of a linear estimator. In this paper we will use similar types of linear estimators for both scalar and vector estimates. The Kalman Filter is an algorithm that helps us determine the optimal value of α. We will conclude that the weight given to an estimate should be proportional to the confidence we have in that estimate. Section 2 describes the model used in Kalman filtering and develops on scalar and vector estimates and confidence in estimates. A scalar estimate is a sample drawn from certain distribution with mean µ and variance σ 2 as its parameters. Vector estimates are samples drawn from a distribution with mean µ and covariance matrix Σ. We quantify confidence in estimates based on the variances and covariances of these distributions. Sections 3-5 develop on the two most important statistical ideas behind the Kalman filter.. Section 3 shows that we should fuse scalar estimates optimally by minimizing the variance of the fused estimate. It is also shown that fusing more than two estimates can give the same result as fusing two estimates recursively, without any loss of quality in the final estimate. Section 4 extends these results from scalar estimates to vector estimates. A vector estimate is a vector where the entries are scalar samples drawn from distinct distributions. Extending the fusion of estimates from the scalar to the vector case might seem difficult due to the complexity of the equations, but the main difference is that variances are simply replaced by covariance matrices. 2. In applications where the estimates are vectors, only part of the vector may be directly observable. For instance if state estimation consists of position and velocity, the position of an aircraft might not be observable, so Section 5 introduces the Best Linear Unbiased Estimator(BLUE). We use the BLUE estimator to obtain the optimal result for the hidden portion of the state that is not directly determined. Section 6 shows how these results can be used to obtain optimal state estimates for linear systems. We first consider the problem of state estimation where the entire state is observable in Sections 3 and 4, and then consider the more complex case of a partially observable state, which requires the use of the BLUE estimator introduced in Section 5. Finally, we illustrate the application of the theoretical results obtained in sections 3-5 by using an example of state estimation of a road vehicle with partial observability of the vehicle s state. 2 SCALAR AND VECTOR ESTIMATES In this section we further explore scalar and vector estimates, and use a basic understanding of probability theory to introduce confidence in the desired estimate. 2. Scalar estimates We will use the hypothetical example proposed in section to develop the basic principles of scalar estimates and the confidence assigned to them. An estimate drawn from thermometer i corresponds to a sample temperature drawn from distribution i. Since temperature is a continuous scalar, each distribution p i has a probability density assigned to every possible temperature interval. The probability of obtaining a precise temperature value from a distribution is assigned a 0 probability, since no thermometer can record exact values of temperature. The i th estimate is obtained from the i th distribution as shown in Figure, where the x-axis corresponds to all possible temperature values in C, and the y-axis represents the probability density. In most applications, the estimates used in the Kalman filtering process are assumed to be drawn from Gaussian distributions, but for the sake of generality, we will only assume that the mean µ i and the variance σi 2 of each distribution p i are known. Throughout the paper, we will use the following notation: x i : p i (µ i, σi 2) to denote that x i is a random sample with mean µ i and variance σi 2 drawn from distribution p i. For complicated expressions containing variance parameters, we will use the reciprocal of the variance called the precision of a distribution to simplify the equations. The confidence in temperature estimates made by each thermometer is determined by the variance of their corresponding distributions. We quantify the concept of confidence in the context presented by the example. When measuring the temperature across resistor terminals, a resistance temperature detector (RTD) gives a highly accurate temperature 2

3 Figure : Distributions Figure 2: Precision Vs. Accuracy estimate compared to a thermocouple when dealing with temperatures below 600 C [4]. Therefore, the distribution corresponding to the RTD has a smaller variance than the distribution used to model the thermocouple s temperature estimation. We should therefore be more confident in the temperature samples drawn from the RTD distribution. The estimates drawn from the distributions may seem ambiguous because we are unsure of what the actual temperature across the resistor is and how close the estimates are to the real value. The approach used to model the confidence in these estimates does not give clarity to how close the estimated temperatures are to the actual temperature across the terminals, so we rely on how close the estimates are to each other. This requires the distinction between precision and accuracy to be made. Accuracy measures how close an estimate is to the true value, usually called the ground truth, and precision evaluates how close the estimates are to each other, with no mention of the ground truth. Figure 2 makes the distinction between the two using darts as an example. The bullseye represents the ground truth value, and the highly-rewarding points around the bullseye repreresent accurate estimates of the true value. In this context, the real value of the temperature across the resistor terminals is unknown due to the continuous nature of such scalar quantity, so confidence in a sample is assigned based on the values obtained from other estimated temperatues. We use the same approach in this paper, and evaluate confidence in estimates based on precision. An important assumption was made in the introduction of the thermometer example. The estimates x, x 2,..., x n are assumed to be uncorrelated. This is intuitive, since the result observed by one thermometer is not affected by the other thermometers recording the temperature at the same time. Uncorrelated scalar estimates x i are formally defined as follows: E[(x i µ i )(x j µ j )] = 0, where i j. Even though dependence and correlation of random variables are related, there is an important distinction to be made between the two. Correlation between two random variables only captures the linear relationship between the two. Two random variables can be uncorrelated and dependent, meaning that they are simply not linearly dependent. The difference between correlation and dependence is further discussed in Appendix 8.. LEMMA 2.. Let x : p (µ, σ), 2..., x n : p n (µ n, σn) 2 be a set of pairwise uncorrelated random variables. Let y be a linear combination of the x i s, such that y = n i= α ix i. The mean and variance of y are: µ y = α i µ i () i= 3

4 PROOF. Equation is obtained from the linearity of expectation: σy 2 = αi 2 σ i (2) i=. µ y = E[y] = E [ n ] α i x i = n i= i= α i µ i Equation 2 is obtained due to the fact that the estimates are pairwise uncorrelated: [ ( n σy 2 = E[(y µ y ) 2 ] = E α i x i [ n = E α i (x i µ i ) i= i= )( n α i µ i α j x j i= ] α j (x j µ j ) = j= j= i= j= ) ] α j µ j j= E [ α i α j (x i µ i )(x j µ j ) ] Because x i are pairwise uncorrelated, E[(x i µ i )(x j µ j )] = 0 for i j, so the result follows: αi 2 E[(x i µ i ) 2 ] = αi 2 σi 2 i= i= 2.2 Vector estimates Instead of dealing with a single random variable, we construct a column matrix with a random variable in each entry, usually called a random vector. For instance, random variables such [ as] the position (X) and velocity(v) of a vehicle with X(t) respect to time could be expressed in a random vector as follows:. In this paper, we denote a vector by a boldfaced V (t) lowercase letter, and a matrix by an uppercase letter. The covariance of a random vector denoted by cov is analogous to the variance of a random variable, but instead measures the joint variability of two random variables. The covariance between two jointly distributed random variables X and Y is defined by cov(x, Y ) = E[(X E[X])(Y E[Y ])] and describes the expectation product of each random variable s deviation from its mean. The covariance matrix of a random vector x with mean µ x is the matrix given by E[(x µ x )(x µ x ) T ]. µ 2 Estimates: A vector estimate x i is a random sample drawn from distribution p i (µ i,σ i ). Each µ i is a vector., where n is the number of random variables in the random vector x i, and each entry represents the mean of each corresponding random variable. The covariance matrix of a n-dimensional random vector [ X X 2 X n ] T is the matrix E[(X µ )(X µ )] E[(X µ )(X 2 µ 2 )] E[(X µ )(X n µ n )] E[(X 2 µ 2 )(X µ )] E[(X 2 µ 2 )(X 2 µ 2 )] E[(X 2 µ 2 )(X n µ n )] Σ= E[(X n µ n )(X µ )] E[(X n µ n )(X 2 µ 2 )] E[(X n µ n )(X n µ n )] or equivalently given by the following expression: Σ = E[(x E[x])(x E[x]) T ]. Each (i, j) entry corresponds to the covariance between the i th and j th random variables of the random vector. Intuitively, the covariance matrix of a random vector is symmetric. If we choose to construct the covariance matrix between two distinct random vectors, the covariance matrix loses its symmetric property, but each (i, j) entry still corresponds to the µ µ n 4

5 covariance between the i th random variable of the first random vector, and the j th random variable of the second random vector. If the dimension of x i is one, the covariance matrix reduces to its first entry, which represents the variance of the random variable X. Similar to the scalar case, the inverse of the covariance matrix Σ i is called the precision or information matrix. Uncorrelated estimates: Estimates x i and x j are uncorrelated if E[(x i µ i )(x j µ j ) T ] = 0, that is, each i th and j th random variables are uncorrelated. This implies that each entry E[(X i µ i )(X j µ j )] of the covariance matrix is zero. The following Lemma generalizes Lemma 2. to the vector case. LEMMA 2.2. Let x : p (µ, Σ ),..., x n : p n (µ n, Σ n ) be a set of pairwise uncorrelated random vectors of length m, and let y = n i= A ix i. The mean and variance of y are: µ y = A i µ i (3) Σ y = i= A i Σ i A T i (4) i= PROOF. Equation 3 follows from the linearity of the expectation operator. We prove equation 4 in the same way we proved equation 2: [ n Σ y = E[(y µ y )(y µ y ) T ] = E A i (x i µ i ) i= ] A j (x j µ j ) T = j= i= j= ] A i E [(x i µ i )(x j µ j ) T A T j Since x i and x j are uncorrelated, E[(x i µ i )(x j µ j )] = 0 for (i j), so the result follows. 3 FUSING SCALAR ESTIMATES In this section, we introduce the fusion process of scalar estimates. In section 3., we discuss the fusion of two scalar estimates, and generalize this to the fusion of n > 2 scalar estimates in section 3.2. Finally in section 3.3, we show that without any loss of quality in the final estimate, we can repeatedly fuse 2 estimates at a time instead of fusing n > 2 estimates. 3. Fusing two scalar estimates We now consider the problem of choosing the optimal value of α when fusing estimates with the linear estimator y α (x, x 2 ) = ( α) x + α(x 2 ), where x i are scalar uncorrelated estimates. We also formalize confidence in estimates. A statistical approach in choosing the appropriate α depends on the variance of y a (x, x 2 ). The optimal value of α will minimize the variance of the fused estimate. This is not counterintuitive, because lower variance yields higher confidence; that is, confidence in the estimate is inversely proportional to the variance of the distribution from which the estimate is drawn. Minimizing the variance of the estimator will therefore produce the a fused estimate with the highest confidence, ŷ α. The variance of the estimator is called the mean square error (MSE), and the minimum value of this variance as α varies is called the minimum mean square error (MMSE). The following Theorem shows how to obtain the MMSE. THEOREM 3.. Let x : p (x) (µ, σ 2 ) and x 2 : p 2 (x) (µ 2, σ 2 2) be uncorrelated scalar estimates fused by the linear estimator y α (x, x 2 ) = ( α) x + α x 2. The variance (MSE) of y α is minimized for α = σ2 σ 2 +. σ2 2 PROOF. From Lemma 2., the two-estimate case (i = 2) is expressed as follows: σ 2 y(α) = ( α) 2 σ 2 + α 2 σ 2 2. (5) Differentiating σ 2 y(α) with respect to α and setting the derivative equal to zero leads to the result: d dα σ2 y(α) = 2σ 2 (α ) + 2ασ 2 2 = 0 α = σ2 σ 2 + σ2 2 5

6 This optimal α value is denoted by K, which is called the Kalman gain. Substituting K into the linear fusion model gives us the optimal linear estimator ŷ(x, x 2 ): ŷ(x, x 2 ) = σ2 2 σ 2 + x + σ2 σ2 2 σ 2 + x 2 σ2 2 Multiplying this by σ 2 σ2 2 σ 2 σ2 2 will help us generalize this result for the fusion of n > 2 estimates: ŷ(x, x 2 ) = σ 2 x + σ 2 2 x 2 (6) σ 2 + σ 2 2 σ 2 + σ 2 2 Substituting K into equation 5 gives the variance of ŷ: σ 2 ŷ = + σ 2 2 (7) σ 2 We have now derived the equations for the optimally fused estimate ŷ and its corresponding variance σŷ. We define a new variable to replace the reciprocal of the variance so we can simplify the process of fusing n > 2 estimates. We let v = and v σ 2 2 =, where v σ2 2 i represents the precision of the corresponding estimate s distribution. Equations 6 and 7 can be rewritten as follows: ŷ(x, x 2 ) = v v + v 2 x + v 2 v + v 2 x 2 (8) vŷ = v + v 2 (9) Since the confidence in an estimate is inversely proportional to its variance, these results show that the weight we assign to each estimate is proportional to the confidence we have in each estimate. Note that if µ = µ 2, then E[y α ] = µ = µ 2, where y α is called an unbiased estimator. The bias of the estimator y α is zero, meaning that the difference between the mean of the estimator and the mean of the true value is zero. 3.2 Fusing scalar estimates for n > 2 In Section 3., we developed a way to optimally fuse two scalar uncorrelated estimates. Note that the estimates x i don t have to be mutually uncorrelated; that is, using pairwise uncorrelated estimates is enough to generalize the fusion approach. The following Theorem shows how to obtain the values of α that minimize the variance of the fused estimate. THEOREM 3.2. Let pairwise uncorrelated estimates {x i } drawn from distributions p i (x) (µ i, σi 2 ) be fused using the linear model y α (x,..., x n ) = n i= α ix i where σi= n α i =. The values α i that minimize the variance of the fused estimate y α are given by: α i = σ 2 i n i= σ 2 i 6

7 PROOF. Lemma 2. can be used to obtain the optimal values α i by minimizing the variance given by the equation σ 2 y(α) = i= α2 i σ2 i. To find the values that minimize σ2 y(α) under the constraint σ n i= α i =, we use the method of Lagrange multipliers. We define the lagrangian as αi 2 σi 2 = λ ( n α i ) i= We then take the gradient of both sides. Note that since we are dealing with scalar estimates, taking the gradient vector of both sides of the equation is equivalent to taking the partial derivatives of each side with respect to each α i : ( n αi 2 σ 2 ) i = λ α i i= α i ( n i= i= ) α i = λ 2σ 2 i We then substitute the values of α i into the constraint n i= α i =, and substitute λ back into the equation α i = λ to obtain the result. Substituting the optimal values obtained into equations σ 2 y(α) = i= α2 i σ2 i and y α(x,..., x n ) = n i= α ix i yields the following results that generalize equations 6-9: 2σ 2 i σ 2 ŷ = n i= σ 2 i (0) ŷ(x,..., x n ) = i= σ 2 i σ 2 i σ 2 n x i These expressions can be simplified by changing parameters from variances to precisions: ŷ(x,..., x n ) = i= vŷ = v i v v n x i () v i (2) i= 3.3 Incremental fusing is optimal In the specific case where a scalar estimate represents the position of a vehicle at a specific time, the estimates would only become available over a period of time as the vehicle moves. If all estimates are available and stored, we can repeatedly use equations and 2 whenever a new estimate becomes available, in order to fuse all currently available estimates. Because this requires unnecessary work and storage, we prefer to perform the fusion process incrementally, so we can save both time and space. Rather than adding new estimates into a running sum every time a new estimate becomes available, we show that we can instead keep a running estimate that is fused with the next available estimate, without any loss in the quality of the final estimate. The desired outcome of this section is illustrated with the equation: ŷ(x, x 2,..., x n ) = ŷ(ŷ(ŷ(x, x 2 ), x 3,..., x n ) where each time, the running estimate is fused with the new estimate. This equation is illustrated in figure 3. The initially available estimate x is first fused with x 2 which shortly becomes available. The labels on the arrows that connect x and x 2 to the fused estimate represent the weights allocated to the estimates. Using equation 8, the first two estimates are fused into ŷ(x, x 2 ), with a precision obtained by equation 9, which is displayed above the estimate. We can proceed with the incremental fusion process when the next estimate x 3 becomes available. Using equation 8 again, x 3 and ŷ(x, x 2 ) can be fused in the same way as the previous step, to produce ŷ(ŷ(x, x 2 ), x 3 ). The incremental fusion process can be repeated n times, where each estimate becomes available over a period of time and is fused with the previously fused estimate. 7

8 Figure 3: Dataflow graph for incremental fusion Each estimate x i contributes to the final value ŷ(ŷ(ŷ(x 0, x ), x 2,..., x n ), and its contribution is given by the product of the weights on the path from that estimate to the final value. For instance, the contribution of x i to the final fused value is given by: v i v v i... v v n v i = v v i v v i+ v v n v v n which yields the same value for the weight of each x i given by equation. Because we can obtain the same result by fusing estimates that successively become available over a period of time, incremental fusion becomes optimal. 3.4 Summary: fusing scalar estimates for n = 2 The results obtained in this section can be used for optimal incremental fusion of scalar estimates. When fusing uncertain scalar estimates with a linear model, the weight allocated to each estimate should be directly proportional to the precision of the distribution that the estimate is drawn from. Intuitively, the weight should be inversely proportional to the variance of that estimate. The most optimal fusion process is incremental, meaning that fusing n > 2 estimates will not result in any loss of quality in the final estimate. Finally, the confidence in the estimate is formally expressed in terms of the Kalman gain, and the equations for incremental fusing of scalar estimates are given below. x : p (µ, σ 2 ), x 2 : p 2 (µ 2, σ 2 2) K = σ2 σ 2 + = v 2 σ2 2 v + v 2 ŷ(x, x 2 ) = x + K(x 2 x ) µŷ = µ + K(µ 2 µ ) σ 2 ŷ = σ 2 Kσ 2 or vŷ = v + v 2 (3) (4) (5) (6) 4 FUSING VECTOR ESTIMATES In this section we extend the results for fusing n > 2 scalar estimates to fusing n > 2 vector estimates. Incremental fusing is still optimal for vector estimates, but we only discuss the generalized fusion model since the two-vector case can be determined from the general case. Similar to section 3, we conclude that vector estimates can be fused by a linear model simply by replacing the variance with a covariance matrix which was previously defined in section Fusing vector estimates for n > 2 For vector estimates, we extend the scalar linear fusion model to the vector case: y A (x, x 2,..., x n ) = A i x i (7) where each matrix A i represents the matrix parameter for each corresponding x i, and n i= A i = I. If µ i = µ j for i j, i= 8

9 y A is an unbiased estimator. Optimallity: The two-norm (l 2 -norm) of an n-dimensional real vector is defined by x = n i= x2 i where x i is the i th entry of the vector x. The MSE of the estimator is given by the expected value of the two-norm of (y A µ ya ), which is E[(y A µ ya ) T (y A µ ya )]. The MMSE is determined by the values A i that yield the optimal covariance matrix of the fused vector estimate. The proof for theorem 4. is similar to the one of Theorem 3.2, but instead uses matrix derivatives which are necessary for Lagrangian optimization with matrix constraints. The complete proof of the theorem is given in Appendix 8.3. Theorem 4. generalizes theorem 3.2, where the variance is replaced by the covariance matrix. THEOREM 4.. Let pairwise uncorrelated estimates x i drawn from distributions p i (µ i, Σ i ) be fused using the linear model y A (x, x 2,..., x n ) = n i= A ix i under the constraint n i= A i = I. MSE(y A ) is minimized for A i = ( n j= Σ ) Σ j i. (8) The covariance matrix of ŷ can be determined by substituting equation 8 into equation 4 given by Lemma 2.2. To show this, we denote the covariance matrix Σ by the matrix B for a clear proof. Bŷ = = A i B i A T i = i= [ ( n i= j= B j [ ( n i= j= B j ) (B (( n ) T i ) B i j= B j B i [ ( n j= ) T ) ] B j ) B i In the following step, we bring n i= inside the square brackets, and change the index of n ] T ] i= (B i ) T from i to j ( n = j= B j ) ( n ( n ) = B j j= j= B j ) T (( n j= ) T ) B j Changing back from B to Σ, the variance of the fused vector estimate ŷ is given: Σŷ = ( n j= Σ ) j (9) The inverse of the covariance matrix, Σ is the precision matrix denoted by N. We simplify Equations 8 and 9 using the precision matrix, and obtain Equations 20 and 2 which generalize ŷ(x,..., x n ) and vŷ to the vector case: ŷ(x,..., x n ) = Nŷ = ( N j ) N i x i (20) i= j= N j (2) j= It can be shown that there is no loss in the quality of the final vector estimate, when fusing n > 2 vector estimates incrementally. The proof is similar to the scalar case in Section 3.3 and is omitted. Similar to the scalar case, we define the Kalman gain K as the value of A that minimizes MSE(ŷ): K = Σ (Σ + Σ 2 ) (22) 9

10 We can generalize Equations 3 through 8 to the vector case by using the Kalman gain. The covariance matrix of the fused estimate ŷ in terms of Kalman gain can be written as follows: Σŷ = Σ (Σ + Σ 2 ) Σ 2 (23) = KΣ 2 = Σ KΣ (24) The Equations obtained in this section can be expressed in terms of the Kalman gain by substituting K = Σ (Σ +Σ 2 ) into each Equation. Equation 26 is obtained in the following way: Obtaining Equation 26: For n = 2, we use Equation 20 to get ŷ(x, x 2 ) = (N + N 2 ) (N x ) + (N + N 2 ) (N 2 x 2 ) = ( K)x + Kx 2 = x + K(x 2 x ) 4.2 Summary: fusing vector estimates for n = 2 The results for fusing two vector estimates in terms of K are as follows: x : p (µ, Σ ), x 2 : p 2 (µ 2, Σ 2 ) K = Σ (Σ + Σ 2 ) = (N + N 2 ) N 2 ŷ(x, x 2 ) = x + K(x 2 x ) µŷ = µ + K(µ + µ 2 ) Σŷ = Σ KΣ or Nŷ = N + N 2 (25) (26) (27) (28) 5 BEST LINEAR UNBIASED ESTIMATOR (BLUE) In most applications, the estimates are vectors. In the example where vector estimates represent the state of a vehicle with respect to time, the entire state of the vehicle might not be directly observable. For instance, state variables such as position and velocity of a rocket might not be available, but can be estimated through the acceleration of the rocket. In this case, the acceleration is observable, but its position and velocity are hidden. It is important to be able to estimate the hidden portion of the rocket s state for an accurate estimation of the rocket s real trajectory. The Gauss-Markov theorem [5] states that for a linear regression model where all errors are mutually uncorrelated with expectation 0 and have equal variances, the best linear unbiased estimator is given by the OLS estimator. This section introduces a generalization of the ordinary least squares (OLS) estimator, where the the set of discrete points to be optimally modeled by a linear relation is replaced by a set of random vectors. A deterministic approach to obtain a value of y for a given value of x will not be enough in the case where x and y are random variables. This requires the modeling of a stochastic process, where we estimate the value of y given x and necessary information on the correlation between the two random variables. The joint probability distribution of scalars x and y is illustrated in Figure 4, where each point (x, y) on the xy plane is mapped to a point on the positive z-axis by the transformation associated to the joint probability function f(x, y). The ellipse in Figure 5 represents the projection of a joint probability distribution onto the xy plane. This shaded region contains the points (x,y) with significant probability; that is, it s the region where most points (x,y) are likely to be found. 0

Figure 4: A joint probability distribution of two random variables Figure 5: BLUE line corresponding to equation 29 For a given value x, there are infinitely many possible values of y, but the values

11 Figure 4: A joint probability distribution of two random variables Figure 5: BLUE line corresponding to equation 29 For a given value x, there are infinitely many possible values of y, but the values of y inside the ellipse are most probable. Furthermore, the value ŷ on the line crossing the ellipse is the most reasonable estimate for y, since the variance at this point is minimized. This line is called the best linear unbiased estimator (BLUE), and is the analog of ordinary least squares (OLS) for distributions. In this paper, we introduce the BLUE estimator which replaces the set of discrete data points {(x i, y i )} to a set of random vectors {(x i, y i )}. 5. Computing BLUE Let x : p x (µ x, Σ xx ) and y : p y (µ y, Σ yy ) be random vectors related by the linear model y A,b (x) = Ax + b. Similar to the OLS approach, we choose the values A and b that minimze the MSE between the random variable y and its estimate, in order to obtain the optimal estimate, ŷ. MSE A,b (ŷ) = E[(y ŷ) T (y ŷ)] = E[(y (Ax + B)) T (y (Ax + b))] = E[((y T (Ax + b)) T )(y (Ax + b))] = E[y T y 2y T (Ax + b) + (Ax + b) T (Ax + b)] By setting the partial derivatives of MSE A,b (ŷ) with respect to b and A equal to zero, we get b = µ y + Σ yx Σ xx(x µ x ) and A = Σ yx Σ xx, where Σ yx is the covariance between y and x. We then susbtistute the results into the linear model:

12 ŷ = (Σ yx Σ xx)x + µ y µ x (Σ yx Σ xx) The best linear estimator is therefore given by the following Equation: ŷ = µ y + Σ yx Σ xx(x µ x ) (29) The expected value of the best linear estimator is equal to µ y, making it an unbiased estimator. If random vectors x and y are uncorrelated, E[(X i µ Xi )(Y j µ Yj )] = 0 i = j, i j, therefore making Σ yx the zero matrix. If the two random vectors are uncorrelated, ŷ = µ y, meaning that different values of x will not provide additional information about y. Furthermore if Σ yx = 0, the BLUE line is parallel to the x-axis, where different values of x are all mapped to the same y value. If y and x are related by a constant matrix C such as y = Cx, then ŷ = Cx since Σ yx = CΣ xx. (This is obtained from the linearity of the expectation operator). In Figure 5, this corresponds to the case where the ellipse is projected onto the BLUE line. The covariance matrix of the best estimator of y is given by the following Equation: Σŷ = Σ yy Σ yx Σ xxσ xy (30) where Σ yx = Σ T xy. Intuitively, for a given value x 0, we can reduce the uncertainty in the estimator of y by a factor that depends on the strength of correlation between x and y. The best linear unbiased estimator given by Equation 29 for the vector case can easily be shown to be a generalization of the OLS estimator. Simply computing the means and variances of each point (x, y) in the set ({x i, y i )} and constructing the vector means and covariance matrices where the i th entry corresponds to the covariance and mean of the scalars (x i, y i ), we can obtain the BLUE line using the OLS approach. 6 KALMAN FILTERING: LINEAR SYSTEMS The most common application of the Kalman filter for linearly behaving systems is state estimation with uncertain observations. In this Section, we apply the theory developed in Sections 3-5 to show how the Kalman filter yields the most optimal state of the system at each discrete timestep, using observed estimates and estimates drawn from a model. Before we apply the filter for state estimation, the linear model of the system must first be determined. This model can be constructed based on the underlying dynamics of the system. In most cases, the equations necessary to construct the model are determined by the physics of the dynamical system. Figure 6(a) shows that by using such model, we can determine the evolution of the state of a system if the details of behind the linear system are known precisely. The function f t is a function of the system s state at the previous time step t and known control inputs such as the acceleration of a particle due to other external variables. This can be expressed by the Equation x t+ = f t (x t, u t ), where x t is the state of the system at time t, and u t is the control input. For linear systems where f t is strictly a linear function, we can express the evolution of the system s state as x t+ = F t x t + B t u t where F t and B t are time dependent matrices determined by the dynamics of the system and the known control variables, respectively. Using such linear model, we can determine the system s state at any time t, as long as the state of the previous time step is known, and the dynamics and control inputs of the system are modeled precisely through matrices F t and B t. The model that determines the state of the system can sometimes be inaccurate compared to the actual state of the system due to statistical noise. Apart from the predicted state at each discrete time step, we use observations of the state s system as a second uncertain estimate at each time step, which allows us to determine the best out of the two estimates, or rather fuse both estimates based on our confidence in each one. If the measurement or the modeled estimate were exact and accurate to the actual state of the system, only one of the two ways of determining the system s state would be needed. In general, both the model and our measurements are noisy, and some components of the state might not be directly observable by measurements at every time step. The Kalman filter is therefore used to obtain the most optimal state at each discrete time step for the entire state of the system. 6. Fusing complete observations of the state If the entire state of the system can be observed through measurements at each time step, we can use both, measurements and predicted estimates and determine the fused state estimate which is the closest to the actual state of the system. If the state measurements and the predicted states are uncorrelated, and the covariance matrices of the initial modeled state and all measurements are known, Equations can be used to determine the optimal fused estimate and its 2

13 corresponding covariance. The fused estimate at time t can be used by the model to obtain the estimate of the model at time t +, which is then fused with the measurement at time t + by using their respective covariance matrices. The covariance of a fused estimate at time t is obtained through Equations 25 and 28, where Σ and Σ 2 are the covariance matrices of the fused estimate at the previous time step and of the measurement at the new time step, respectively. This recursive process is illustrated in Figure 6(b). The initial state at time t = 0 is fed into the model, which produces the predicted estimate of the state and its corresponding covariance at time t =, denoted by x 0 and Σ 0, respectively. The observation of the state at time t = with covariance matrix R is then fused with estimate x 0 using Equation 26 to produce the optimally fused state estimate x at time t =. Intuitively, the notation x 0 stands for the estimate of the state at time t = given the information at time t = 0, called the a priori estimate. Similarly, x t+ t+ is the corresponding fused estimate given information of the state at t =, often called the a posteriori estimate. We introduce the following generalized notation to proceed with the problem of obtaining state estimations for a linear system. The initial state of the system is denoted by x 0 with covariance Σ 0 0. In the example presented in Section 6.3, we make an arbitrary assumption that Σ 0 0 = Q t where Q t is the covariance matrix of the normally distributed noise term in the state evolution equation. Introducing a zero-mean noise term w t into the state evolution equation due to uncertainty in the model and the known control inputs requires x t+ t to be a random variable. The state evolution equation is now changed to x t+ t = F t x t t + B t u t + w t (3) where w t : N(0, Q t ) is uncorrelated with x t t. The measured estimate of the state at time t + is denoted by z t+ and is a random vector due to the normally distributed noise term v t+ : N(0, R t+ ). The measurement is modeled as follows: z t+ = x t+ + v t+, where v t+ is also uncorrelated with the a priori estimate. Figure 6c illustrates the application of the process described in Figure 3 to a linear system with uncertainty in both, the dynamical system and the measurements. The covariance matrix of the a posteriori estimate Σ t+ t can be obtained from the covariance matrix Σ t t by using Equation 4, which gives Σ t+ t = F t Σ t t F T t + Q t Figure 6c shows that we can proceed with the fusion of the two vector estimates as described in Section Fusing partial observations of the state In some applications such as in state estimation, part of the state might not be available for measurements to be drawn. In this case, the prediction is obtained in the same way, and the only difference lies within the fusion phase. The following steps are given for an understanding of the fusion process when part of the state is not observable. We will later illustrate this in Section 6.3, where the position of a vehicle s state is hidden and the velocity is observable. i. The observable component of the a priori estimate x t+ t is fused with the corresponding measurement using the equations derived in Sections 3-4 to produce the a posteriori estimate of the observable state. ii. To determine the hidden portion of the a posteriori estimate, we use the a posteriori estimate of the observable state and the BLUE estimator. iii. Both the hidden and observable a posteriori estimates are then used to obtain the a posteriori estimate of the entire state of the system. Implementing the filtering process does not outline the steps mentioned above, as shown in Figure 6d, but the steps are still necessary for a complete understanding of how the BLUE estimator is used when part of the state is hidden. 3

14 Figure 6: State estimation using Kalman filtering D state application example Figure 6 illustrates the steps outlined in Section 6.2 using a two-dimensional problem of estimating a state vector, where the first component can be measured directly, and the second component is hidden. This example is convenient because we can separately treat the state components as scalars, since one of the components is hidden, therefore avoiding to deal with vectors. We introduce the following notation to simplify the example. [ ] hi a priori estimate: x i = where h c i is the observable component, and c i is the hidden component. i [ ] σ 2 covariance matrix of x i : Σ i = h σ hc a posteriori state estimate: x 0 = measured estimate: z variance of measured estimate: r 2 σ ch σc 2 [ ] ho c o Steps i through iii are mathematically expressed by the following calculations which are also illustrated in Figure 7. i. The a posteriori estimate of the observable state is simply obtained by the a priori estimate using Equation 4: h o = h i + σ2 h σh 2 + (z h i) = h i + K h (z h i ) r2 4

15 where K h = σ2 h σh 2 +. r2 ii. The a posteriori estimate of the hidden state, c o, is determined using the a priori estimate of the hidden state and Equation 29 simplified to the scalar case, where covariance matrices are replaced by variances. where K c = c o = c i + σ hc σ 2 h σ hc σ 2 h + r2. σh 2 σh 2 + (z h i) = c i + σ hc r2 σh 2 + (z h i) = c i + K c (z h i ) r2 iii. Combining steps i and ii, we get [ ho ] = c o [ hi c i ] + [ ] Kh (z h K i ) c We express this result in terms of matrices as a step towards the general case presented in Section We define H = [ 0 ] and R = r 2, and express x o and K as follows: x o = x i + K(z Hx i ) K = Σ i H T (HΣ i H T + R) Figure 7: Computing the a posteriori estimate when part of the state is not observable Figure 7 illustrates the computations of the previous steps, where c o is obtained by adding K h (z h i ) to c i, and similarly, h o is obtained by adding K c (z h i ) to h i. The BLUE estimator is used to determine the hidden component of the a posteriori estimate, allowing us to construct the a posteriori estimate of the entire state by combining the hidden and observable components together General case Suppose that the observable component of x is given by Hx where H is a full row-rank matrix; that is, each of the rows of matrix H are linearly independent. Let C be a basis for the orthogonal complement of H, meaning that each row of the matrix C is orthogonal to every row of the H matrix. If Σ represents the covariance matrix of x, it can easily be shown that the covariance between Cx and Hx is CΣH T by the linearity of the expectation operator. We generalize the steps listed in Section 6.2. to compute the a posteriori estimate from the a priori estimate. i. The a priori estimate of the observable state is given by Hx t+ t, and the corresponding a posteriori estimate is obtained using Equation 26: Hx t+ t+ = Hx t+ t + HΣ t+ t H T (HΣ t+ t H T + R t+ ) (z t+ Hx t+ t ) We can simplify this result by using the Kalman gain notation K t+ = Σ t+ t H T (HΣ t+ t H T + R t+ ) and rewriting the previous equation as follows: Hx t+ t+ = Hx t+ t + HK t+ (z t+ Hx t+ t ) (32) 5

16 ii. The a posteriori estimate of the hidden state can be obtained using the corresponding a priori estimate and Equation 29. This result is functionally similar to the result given by Equation 32. Cx t+ t+ = Cx t+ t + (CΣ t+ t H T )(HΣ t+ t H T ) HK t+ (z t+ Hx t+ t ) = Cx t+ t + CK t+ (z t+ t Hx t+ t ) (33) iii. Combining the a posteriori estimates of the hidden and observable components yields the following generalized result: [ ] [ ] [ ] H H H x C t+ t+ = x C t+ t + K C t+ (z t+ Hx t+ t ) Matrices C and H are invertible since they both have linearly independent rows. This implies that well, so the equation becomes [ ] H is invertible as C x t+ t+ = x t+ t + K t+ (z t+ Hx t+ t ) = (I K t+ Hx t+ t + K t+ z t+ (34) Lemma 2.2 allows us to compute the covariance of the entire a posteriori estimate x t+ t+. Both estimates x t+ t and z t+ are uncorrelated, so we get Σ t+ t+ = (I K t+ H)Σ t+ t (I K t+ H) T + K t+ R t+ K T t+ We can then substitute the value of K t+ given in step i to obtain the covariance of the entire a posteriori estimate. Σ t+ t+ = (I K t+ H)Σ t+ t (35) These results are illustrated in Figure 6d. If the entire state is observable, that is, all components can be measured, the results in this section reduce to those given in Figure 6c. 6.3 An example of a 2D state estimation with partial observability The simplest application of the Kalman filtering process is the state estimation of a vehicle. We illustrate the concepts discussed in Section 6 with the position and velocity of an initially stationary vehicle. The vehicle starts at position x = 0m with initial speed v = 00 m s and decelerates along a straight line with a constant deceleration of 5 m s until it stops. 2 In this example, we assume that we have no known control inputs, so the B t u t term in Equation 3 vanishes. The underlying linear dynamical system is modeled by the equations d(t) = d 0 + v 0 t + 2 at2 and v(t) = v 0 + at which are used for systems with constant translational acceleration in a straight line, where d(t) represents the vehicle s distance from the origin at time t, and v(t) represents the vehicle s speed at time t. For the purpose of the experiment, we discretize time into intervals of 0.25 seconds. The equations for the vehicle s distance and speed can be manipulated under the following conditions where t = 0.25s and a = 5 m s as follows: 2 d(t + ) = d(t) v(t) ( ) v(t + ) = v(t) (0.25 5) Using the equations above, we can construct a two-dimensional state prediction model with a zero-mean normally distributed statistical noise variable w t, and initial conditions d(0) = 0s and v(0) = 00 m s : [ ] dt+ = v t+ [ ] [ ] 0.25 dt + 0 v t [ ] [ w ] t (36) In this example we use MATLAB to plot the different types of position and velocity trajectories that are determined by the model itself, with and without a noise variable. Furthermore, we plot the observable component at each time step. Finally, we compare these estimated trajectories to the optimal trajectory determined by the filter, in order to understand the effect of the filter in a real life application. In our example, the observable component of the state is the speed, and the hidden portion of the state is the position of the vehicle. The black lines in Figures 8 and 9 represent the evolution of velocity and position according to the model without a noise variable. These lines are called the model trajectories. The red lines in the figures show the trajectories obtained by the state evolution model with a normally distributed noise term w t : N(0, Q t ) added to it, where Q t is the covariance of the error terms. These [ lines are ] called the ground truth trajectories. We arbitrarily choose an intuitively acceptable covariance matrix Q t = which is the same at all

17 time steps. Furthermore, we arbitrarily make the covariance matrix Σ 0 0 = Q t. The green points in the velocity graphs represent the noisy observations of the system s state at each discrete time step, denoted by z. The noise terms z are normally distributed zero-mean estimates with constant variance r 2 = 8. Finally, the blue lines show the a posteriori estimates of the velocity and position at time t. Because the noise terms are random variables, there are many possible trajectories that could be plotted. We illustrate one in Figure 8. Figure 8: Estimates of the car s state over time Figure 9: Behavior of standard deviation of car s velocity and position 6.4 Analysis of results obtained from Example 6.3 The length of each error bar on the graphs displayed in Figure 9 are given by twice the standard deviation of the a posteriori estimates. By maintaining a list of the variances of the position and velocity at each time step, we observe that the standard deviation of the velocity converges to a value of.6. This result is not counterintuitive, since we repeatedly fuse an uncertain observation with a refined predicted estimate that continuously smoothes out the noise produced by the predictor. On the other hand, the standard deviation of the a posteriori estimates on the position graph diverges. This can be explained by the fact that the BLUE estimator used to determine the hidden portion of the a posteriori estimates gives uncertain estimations, meaning that the estimate of the vehicle s position is always inaccurate and sometimes even imprecise. Additionally, there is no second estimator of the position of the vehicle like there was for velocity, so we rely on the estimations obtained by the algorithm alone. The accumulation of these imprecise results leads to a diverging standard deviation, meaning that the position of the vehicle estimated by the algorithm becomes increasingly unreliable as time progresses. This is shown in the position graphs of Figures 8 and 9, where the error bars increase in size and the trajectory outlined by the model (blue line) begins to separate from the ideal trajectory illustrated by the black line. 7

18 6.5 Discussion Equation 34 shows us that the a prosteriori estimate is a linear combination of the a priori estimate x t+ t and the measurement estimate of the observable state, z t+. The Equation is not the optimal unbiased linear estimator for combining the two estimates, so we use the M SE minimizing technique to obtain an unbiased estimator with minimum MSE. If the a posteriori estimate is of the form K x t+ t + K 2 z t+, we choose values K and K 2 that produce the best optimal unbiased estimator. These results are underlined in extensions of the Kalman filter, and will not be covered in this paper. The assumption that the noise terms w t and v t are Gaussian restricts certain applications, so other filtering techniques replace the standard Kalman filtering process, where the results no longer assumed to be normally distributed. In extensions of the Kalman filter presented in this paper, one considers a nonlinear system where the results are much more complex. Also, if the noise variables are not Gaussian, it can be shown that there exist other nonlinear estimators with a lower M SE value than the one previously introduced in this paper. If the noise variables are normally distributed, then we can be certain that the linear estimator is a minimum variance unbiased estimator (MVUE). Finally, this property is not needed to develop a basic understanding of the Kalman filtering process. 7 CONCLUSIONS Kalman filtering is a widely used state estimation process in computer systems, and has many applications in the field of engineering. Robot motion and navigation of vehicles are the most common standard applications of the Kalman filter, and will be frequently seen when autonomous vehicles are implemented in society. Such a task requires an incredibly accurate state estimation process, so by understanding the basic concepts behind the Kalman filtering process, the systems to be filtered can be extended to the non-linear case. Understanding a simplified approach to the algorithm is crucial and necessary in developing the knowledge necessary for applying the filter to computer science and engineering problems. 8

An Elementary Introduction to Kalman Filtering

An Elementary Introduction to Kalman Filtering arxiv:1710.04055v [cs.sy] 10 Nov 017 ABSTRACT Yan Pei University of Texas at Austin ypei@cs.utexas.edu Donald S. Fussell University of Texas at Austin fussell@cs.utexas.edu