Optimal Estimation of Dynamic Systems

Size: px

Start display at page:

Download "Optimal Estimation of Dynamic Systems"

Antonia Wiggins
5 years ago
Views:

1 Probability Concepts in Optimal Estimation of Dynamic Systems Kamesh Subbarao Associate Professor Department of Mechanical and Aerospace Engineering The University of Texas at Arlington Phone: (214) / 58

2 Outline Probability Concepts in. Least squares estimation. Linear batch estimation (linear least squres, weighted least squares, constrained least squares) Linear sequential estimation Nonlinear sequential estimation Advanced topics Examples Introduction to probability, random processes & statistics. Probability concepts in Least squares. Minimum variance estimation (with and without a priori estimates) Maximum likelihood estimation (MLE) Cramer-Rao inequality Bayesian estimation Advanced topics Examples 2 / 58

3 Probability Concepts in Vectors & matrices Vectors & matrices Basic Linear Algebra and Matrix Calculus A quantity A R m n denotes a matrix with m rows and n columns, wherein each entry of the matrix, a ij located at the i th row and the j th column is a real number. A quantity x R n 1 is a column matrix. Vectors can be considered as a special case of matrices. Row vectors when, x R 1 m and column vectors when x R n 1. Without loss of generality, in this course we will consider all vectors as column vectors. A T represents the transpose of the matrix A. The entries of A T are then a ji. The transpose of a column vector is a row vector. If x 1 x = x 2 : then x T = [x 1 x 2.. x n ]. x n 3 / 58

4 Vectors Probability Concepts in Vectors & matrices Basic Linear Algebra and Matrix Calculus Let x = [x 1, x 2,, x n ] T and y = [y 1, y 2,, y n ] T Vector Norm: A measure of the length of a vector (also called norm) x = x T x = [ n i=1 x 2 i ] 1/2 If α is a scalar, αx = α x, where. denotes the absolute value. Unit Vector: ˆx = x. The carat will will used to denote the estimate x of the variable later (all variable embellishments are purely contextual). Dot/Scalar Product: x T y = y T x = n i=1 x iy i (Note: the dot product gives us the norm of the vector). If the dot product is zero, the two vectors are orthogonal. An orthonormal set of vectors when the dot product is zero and the norms of the vectors equal unity. 4 / 58

5 Vectors Probability Concepts in Vectors & matrices Basic Linear Algebra and Matrix Calculus Triangle Inequality: x + y x + y Cauchy-Schwarz Inequality: 0 x T y x y 1 Orthogonal Projection: The orthogonal projection of y on to x is given by, p = x T y x 2 x Cross/Vector Product: z = x y = [x ] y where, 0 x 3 x 2 [x ] = x 3 0 x 1. Observe, [x ] is a skew-symmetric x 2 x 1 0 matrix and is called the cross-product operator here. 5 / 58

6 Probability Concepts in Vectors & matrices Basic Linear Algebra and Matrix Calculus Matrices The system of m linear equations, y 1 = a 11 x 1 + a 12 x a 1n x n y 2 = a 21 x 1 + a 22 x a 2n x n. y m = a m1 x 1 + a m2 x a mn x n can be written in a matrix form as, y = Ax, where y R m, A R m n and x R n. (Note: all of the components of x are being mapped into y via the matrix A). The rank of A is the smaller of the number of linearly independent rows and the number of linearly independent columns. If m = n, the matrix A is square. The solution for x is determined if m = n, under-determined if n > m = rank(a) (an infinity of solutions possible), over-determined if m > n = rank(a) (usually no exact solution exists) If the rank conditions are not satisfied for the solution, we can use the generalized inverse. 6 / 58

7 Probability Concepts in Vectors & matrices Basic Linear Algebra and Matrix Calculus Matrix Addition, Subtraction and Multiplication For addition and subtraction, matrices must have the same dimension. i.e. C = A ± B implies, c ij = a ij ± b ij. Matrix addition and subtraction is commutative and associative. Matrix multiplication requires conformable matrices. If C = AB, then number of columns of A should be equal to the number of rows of B. c ij = n k=1 a ikb kj. Matrix multiplication is associative but not commutative in general. (αa) T = αa T, where α is a scalar (A + B) T = A T + B T (AB) T = B T A T Matrix Inverse: For a square matrix A, the following statements are equivalent A has linearly independent columns A has linearly independent rows The inverse satisfies A 1 A = AA 1 = I A nonsingular matrix is one whose inverse exists (likewise A T is nonsingular): (A 1) 1 = A; (A ) T 1 ( = A 1) T = A T 7 / 58

8 Probability Concepts in Matrix Inverse and Trace Vectors & matrices Basic Linear Algebra and Matrix Calculus For two nonsingular matrices, A and B: (AB) 1 = B 1 A 1 For an orthonormal matrix C, C 1 = C T. The determinant of an orthonormal matrix can be shown to be ±1. An orthonormal matrix preserves the length of a vector. Thus if C is an orthonormal matrix, Cx = x. In general for an orthogonal matrix, D, D T D = DD T = det(d)i Sherman-Morrison Lemma: For conformable matrices A and B: (I + AB) 1 = I A (I + BA) 1 B Matrix Inversion Lemma: For conformable ( matrices A, B, C and D: (A + BCD) 1 = A 1 A 1 B DA 1 B + C 1) 1 DA 1 Trace of a Matrix (square matrices only): Another quantity used in estimation and control theory. It is the sum of all the diagonal elements. Tr(A) = n i=1 a ii Tr(αA) = αtr(a); Tr(A + B) = Tr(A) + Tr(B) Tr(AB) = Tr(BA); Tr(xy T ) = x T y; Tr(Axy T ) = x T Ay xy T is also known as the outer product of the vectors x and y. In general, xy T yx T 8 / 58

9 Probability Concepts in Matrix Definiteness Vectors & matrices Basic Linear Algebra and Matrix Calculus Definiteness: Often required when we look for sufficiency, stability & convergence tests when dealing with functions involving multiple variables. A real and square matrix A is Positive definite if x T Ax > 0 for all non-zero x Positive semi-definite if x T Ax 0 for all non-zero x Negative definite if x T Ax < 0 for all non-zero x Negative semi-definite if x T Ax 0 for all non-zero x Indefinite when no definiteness can be asserted Notice that the definiteness of a matrix is inferred through a scalar measure. The scalar measure, x T Ax is termed a Quadratic form. Thus for two real square matrices, if one writes, A > B, the implication is A B is positive-definite or x T (A B) x > 0. 9 / 58

10 Probability Concepts in Basic Linear Algebra Vectors & matrices Basic Linear Algebra and Matrix Calculus Matrix Rank and Nullity: The rank of a matrix is given by the dimension of the range of the matrix corresponding to the number of linearly independent rows or columns. An m n matrix is rank-deficient if the rank of A is less than min(m, n). Suppose that the rank of a n n matrix A is given by rank(a) = r. Then a set of (n r) nonzero unit vectors ˆx i can be found such that: Aˆx i = 0, i = 1, 2,..., n r (n r) is the nullity, which is the maximum number of linearly independent null vectors of A. Eigenvalues/Eigenvectors of a Matrix: Ap = λp. λ is the eigenvalue and p is the nonzero (right) eigenvector. If the eigenvalues are distinct, then the set of eigenvectors is linearly independent. Then, the matrix A can be diagonalized as, Λ = P 1 AP, where Λ = diag [λ 1 λ 2 λ n] and P = [p 1 p 2 p n ] Note, if A is symmetric then P is orthogonal. 10 / 58

11 Probability Concepts in Basic Linear Algebra Vectors & matrices Basic Linear Algebra and Matrix Calculus QR Decomposition: This is useful in least squares and the Square Root Information Filter (SRIF). The QR decomposition of an m n matrix A is given by, A = QR, where Q is an m m orthogonal matrix and R is an upper triangular m n matrix with all elements R ij = 0 for i > j. If A has full column rank, then the first n columns of Q form an orthonormal basis for the range of A. Singular Value Decomposition: This decomposes an m n matrix A into a diagnoal matrix and two orthogonal matrices: A = USV, where U is an m m unitary matrix, S is an m n matrix wherein S ij = 0 for i j and V is an n n unitary matrix. Let A = USV be the reduced representation where the (n + 1) and higher rows of S (correspondingly the columns of U) are eliminated The elements of S = diag[s 1 s 2 s n] are termed the singular values of A and ordered smallest to largest. These values tell us how well one can invert a matrix. Condition Number = sn. Large condition numbers indicate a near s 1 singular matrix. 11 / 58

12 Probability Concepts in Vectors & matrices Basic Linear Algebra and Matrix Calculus Basic Linear Algebra & Matrix Calculus LU and Cholesky Decompositions: The LU decomposition factors an n n matrix A into a product of a lower triangular matrix L and an upper triangular matrix U, so that A = LU. For symmetric positive definite matrices, A = LL T, wherein L is known as the matrix square root and the factorization is known as the Cholesky decomposition. Matrix Calculus: Consider a scalar function f (x), where x is an n 1 vector. The Jacobian (gradient) of f (x) is an n 1 vector given by, xf f x = f x 1 f x 2 : f x n 12 / 58

13 Probability Concepts in Matrix Calculus Vectors & matrices Basic Linear Algebra and Matrix Calculus The Hessian of f (x) is an n 1 vector given by, 2 f 2 f 2 f x1 2 x 1 x 2 x 1 x n 2 f 2 f 2 f 2 xf 2 f x x = x T 2 x 1 x2 2 x 2 x n f 2 f 2 f x n x 1 x n x 2 xn 2 Note, that the Hessian of a scalar function is a symmetric matrix. If f (x) is an m 1 vector and x is an n 1 vector, then the Jacobian matrix is an m n matrix. f 1 f 1 f 1 x 1 x 2 x n xf f f 2 f 2 f 2 x = x 1 x 2 x n f m f m f m 13 / 58

14 Probability Concepts in Matrix Calculus Vectors & matrices Basic Linear Algebra and Matrix Calculus A list of derivatives involving some products are given below, x (Ax) = A A (x T Ay) = xy T A (x T A T y) = yx T x (x T A T x) = (A + A T )x A very comprehensive list of identities and other Matrix manipulation tools can be found here. Matrix identitites: Matrix manipulations: http: //home.online.no/ pjacklam/matlab/doc/mtt/index.html More Free Stuff: jburkardt/m_src/m_src.html 14 / 58

15 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Linear For a measurable quantity, x, the following two equations hold, measured value = true value + measurement error x = x + v measured value = estimated value + residual error x = ˆx + e Note, The actual measurement error (v) and the true value (x) is never known in practice. The errors in the process/mechanism that physically generate this error are usually approximated by some known process (Gaussian noise with known variance). The statistical properties are utilized to weight the relative importance of the various measurements used in the estimation scheme. The residual error is known exactly and is easily computed once an estimated value has been found. The residual error drives the estimator. 15 / 58

16 Probability Concepts in Linear Batch Estimation Linear Batch Estimation Linear Sequential Estimation Consider a batch of measurements obtained at discrete instants of time: {(ỹ 1, t 1 ), (ỹ 2, t 2 ), (ỹ 3, t 3 ),, (ỹ m, t m)} We wish to model these measurements via a mathematical model. (REMEMBER: YOU ARE PROPOSING THE MODEL!) y(t) = n x i h i (t), i=1 m n where, h i (t) {h 1 (t), h 2 (t), h 3 (t),, h n(t)} are a set of independent specified basis functions. x i are the constants whose values are to be determined. We seek to obtain optimum x-values based upon a measure of how well the model predicts the measurements. Errors in the prediction are usually due to, measurement errors incorrect x-values modelling errors, i.e. the proposed model was bad. 16 / 58

17 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Linear Batch Estimation Lets relate the measurements (ỹ j ) and the estimated outputs (ŷ j ). ỹ j ỹ(t j ) = n x i h i (t j ) + v j i=1 j = 1, 2,..., m The estimated outputs are computed using the estimated values of x and the basis functions. ŷ j ŷ(t j ) = n ˆx i h i (t j ) i=1 j = 1, 2,..., m What about v j? Clearly, ỹ j = n i=1 ˆx ih i (t j ) + e j, where e j is the residual error, e j = ỹ j ŷ j. We can compactly represent the above by combining the measurements at all time instants and stacking them up as, ỹ = H ˆx + e (1) 17 / 58

18 Probability Concepts in Linear Batch Estimation Linear Batch Estimation Linear Sequential Estimation where, ỹ = [ỹ 1 ỹ 2 ỹ m] T = measured y values e = [ẽ 1 ẽ 2 ẽ m] T = residual errors ˆx = [ˆx 1 ˆx 2 ˆx n] T = estimated x values h 1 (t 1 ) h 2 (t 1 ) h n(t 1 ) h 1 (t 2 ) h 2 (t 2 ) h n(t 2 ) H =... h 1 (t m) h 2 (t m) h n(t m) Similarly one can also develop the following equations, ỹ = Hx + v (2) ŷ = H ˆx (3) Equations (1) and (2) are commonly referred to as the observation equations. 18 / 58

19 Probability Concepts in Linear Batch Estimation Linear Batch Estimation Linear Sequential Estimation Gauss least squares principle selects the optimum ˆx by minimizing the sum square of the residual errors, given by J = 1 2 et e = 1 2 (ỹ H ˆx)T (ỹ H ˆx) or J = J(ˆx) = 1 2 ( ) ỹ T ỹ 2ỹ T H ˆx + ˆx T H T H ˆx The multiplier 1/2 has a statistical significance (will be discussed later). Now, use matrix calculus differentiation rules to obtain the necessary condition for the minimum. ( 1 Necessary condition: ˆx J = H T H ˆx H T ỹ = 0, i.e., ˆx = H H) T HT ỹ Sufficient condition: 2ˆx J 2 J ˆx ˆx T = HT H > 0 (H T H is positive definite) The inverse of H T H is required. A good choice of the basis functions is important. 19 / 58

20 Probability Concepts in Linear Batch Estimation Linear Batch Estimation Linear Sequential Estimation Example 1: Scalar dynamical system. ẏ = ay + bu. u is external input; a and b are constants. Consider the discrete equivalent for a constant sampling interval t, y k+1 = Φy k + Γu k. Note, Φ = e a t and Γ = t 0 be a t dt = b a ( e a t 1 ) Given measurements, y(t k ) for an impulse input u 1 = 100, u k = 0, fork 2 obtain estimates of Φ and Γ values See Example 1 Try different impulse values and analyze the performance of the batch estimator. Try different noise standard deviations and compare performance. 20 / 58

21 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Weighted If each measurement is made with different precisions, it is better if this aspect is captured by weighting the measurements accordingly. The choice of the weights is non-unique but it turns out that the inverse of the measurement error variance is an intuitive choice. J = 1 2 et W e Necessary condition: ˆx J = H T W H ˆx H T W ỹ = 0, i.e., ( 1 ˆx = H T W H) H T W ỹ Sufficient condition: 2ˆx J 2 J ˆx ˆx T = HT W H > 0 (H T W H is positive definite) W is typically chosen to be a diagonal matrix. A subset of the w ii are chosen much larger than the others to reflect the preciseness of that specific subset of measurements. 21 / 58

22 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Constrained A limiting case of the weighted least squares is when a perfect measurement is obtained. The corresponding weight w ii = and can cause difficulties. How then to impose equality constraints in estimation problems? Suppose the observations are partitioned into sub-systems ỹ 1 and ỹ 2 and let ỹ 2 correspond to the perfect measurements. or ỹ 1... ỹ 2 = H 1... H 2 ˆx + e (4) ỹ 1 = H 1ˆx + e 1 (5) ỹ 2 = H 2ˆx (6) Let ỹ 1 R m 1 1, ỹ 2 R m 2 1, ˆx R n 1, n m 2 and n m / 58

23 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Constrained Notice, there is no residual error in equation (6). This has to be satisfied exactly. Thus, the cost funtion to be minimized becomes, J = 1 2 et 1 W 1 e 1 = 1 2 (ỹ 1 H 1ˆx) T W 1 (ỹ 1 H 1ˆx) subject to the equality constraint, ỹ 2 H 2ˆx = 0. We use the method of Lagrange Multipliers to solve this. We begin by augmenting the cost function with a vector of Lagrange Multipliers (λ)as follows, Necessary conditions: J a = 1 2 (ỹ 1 H 1ˆx) T W 1 (ỹ 1 H 1ˆx) + λ T (ỹ 2 H 2ˆx) ˆx J a = (H T 1 W 1 H 1 ) ˆx H T 1 ỹ 1 HT 2 λ = 0 (7) ˆλ Ja = ỹ 2 H 2ˆx = 0 ỹ 2 = H 2ˆx (8) 23 / 58

24 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Constrained Solve for ˆx from equation (7) i.e ˆx = ( H T 1 W 1 H 1 ) 1 H T 1 W 1 ỹ 1 + ( H T 1 W 1 H 1 ) 1 H T 2 λ Now subsitute the result in equation (8), ) ] 1 1 ) ] 1 λ = [H 2 (H T1 W 1 H 1 H T2 [ỹ 2 H 2 (H T1 W 1 H 1 H T1 W 1 ỹ 1 Substitute the value of λ back in equation (7) to finally solve for ˆx Let ) 1 ) ] 1 1 K = (H T 1 W 1 H 1 H T 2 [H 2 (H T 1 W 1 H 1 H T 2 and Thus, x = ( H T 1 W 1 H 1 ) 1 H T 1 W 1 ỹ 1 ˆx = x + K (ỹ 2 H 2 x) Note, if H 2 is a square matrix, K = H 1 2 COMFORTING! 24 / 58

25 Probability Concepts in Linear Sequential Estimation Linear Batch Estimation Linear Sequential Estimation Consider the case when measurements become sequentially available in subsets and estimates are determined immediately upon receipt of such a subset of data. When a new data subset arrives, it is desirable to determine the new estimates based upon all previous measurements including the current subset. Let us examine the case with two subsets for simplicity, ỹ 1 = H 1 x + v 1 (9) ỹ 2 = H 2 x + v 2 (10) The least squares estimate based on the first measurement subset is ( ) 1 ˆx 1 = H T 1 W 1 H 1 H T 1 W 1 ỹ 1 To consider both ỹ 1 and ỹ 2 simultaneously, we merge the two into a single observation equation, ỹ = Hx + v. 25 / 58

26 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Linear Sequential ỹ = Hx + v where, ỹ 1 H 1 v 1 ỹ =..., H =..., ˆv =... ỹ 2 H 2 v 2 Assuming the merged weight matrix is block diagonal so that,. W W = W 2 The optimal estimate based upon the first two measurements is given by, ( 1 ˆx 2 = H T W H) H T W ỹ Since W is block diagonal, ] 1 ) ˆx 2 = [H T 1 W 1 H 1 + H T 2 W 2 H 2 (H T 1 W 1 ỹ 1 + H T 2 W 2 ỹ 2 26 / 58

27 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Linear Sequential [ ] 1 ] 1 Let P 1 H T 1 W 1 H 1 and P2 [H T 1 W 1 H 1 + H T 2 W 2 H 2 ] Notice, P 1 2 = P [H T 2 W 2 H 2 After some manipulations, one can obtain, ˆx 1 = P 1 H T 1 W 1 ỹ 1 ˆx 2 = ˆx 1 + K 2 (ỹ 2 H 2ˆx 1 ).. (ỹ ) ˆx k+1 = ˆx k + K k+1 k+1 H k+1ˆx k with, K k+1 = P k+1 H T k+1w k+1 P 1 k+1 = P 1 k + H T k+1w k+1 H k+1 Thus we modify the previous best correction ˆx k by an additional correction to account for the information contained in the k + 1 measurement subset. This is the Kalman Update Equation for computing the improved estimate ˆx k+1. K k+1 is termed the Kalman Gain Matrix. 27 / 58

28 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear Most real world estimation problems are nonlinear Linear versions of the estimation problem and associated developments apply only to a subset of problems encountered in practice. Most nonlinear estimation problems can be accurately solved by a judiciously chosen successive approximation procedure. Most widely used Successive Approximation Procedure Nonlinear Least Squares - also known as Gaussian Least Squares Differential Correction (Early application by Gauss in 1800s to determine planetary orbits from telescope measurements of the line of sight angles to the planets) 28 / 58

29 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear Assume m observable quantities modelled as y j = f j (x 1, x 2,..., x n); j = 1, 2,..., m; m n where, the f j (x 1, x 2,..., x n) are m arbitrary independent functions of the unknown parameters x i. We require that f (x i ) and at least its first partial derivatives be single-valued, continuous and at least once differentiable. Suppose a set of observed values of the variables y j are available, Measurement Model: ỹ = f (x) + v y j {y 1, y 2,..., y m} 29 / 58

30 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear ỹ = [ỹ 1 ỹ 2... ỹ m] T = measured y values f (x) = [f 1 f 2... f m] T = independent functions x = [x 1 x 2... x n] T = true x values v = [v 1 v 2... v m] T = measurement errors Estimated y-values: ŷ = f (ˆx) e = ỹ ŷ y ŷ = [ŷ 1 ŷ 2... ŷ m] T = estimated y values ˆx = [ˆx 1 ˆx 2... ˆx n] T = estimated x values e = [e 1 e 2... e m] T = residual errors 30 / 58

31 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear Rewrite Measurement Model: ỹ = f (ˆx) + e As before, we seek an estimate (ˆx) for x that minimizes J = 1 2 et W e = 1 2 [ỹ f (ˆx)]T W [ỹ f (ˆx)] W is an m m weighting matrix used to weight the relative importance of each measurement. Gauss Procedure: Assume current estimates of the unknown x-values are available x c = [x 1c x 2c... x nc] T 31 / 58

32 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear Whatever the unknown objective x-values, ˆx are, we assume they are related to the respective current estimates, by an also unknown set of corrections x. ˆx = x c + x Linearize f (ˆx) about x c. f (ˆx) f (x c) + H x where, H = f x xc 32 / 58

33 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear The Gradient Matrix H is known as the Jacobian matrix. The measurement residual after the correction can now be linearly approximated as y ỹ f (ˆx) ỹ f (x c) H x = y c H x where, the residual before the correction is y c ỹ f (x c) Objective: Minimize weighted sum squares J Strategy: To determine approximate corrections in x, select particular corrections that lead to minimum sum of squares of the linearly predicted residuals, J p: J = 1 2 y T W y J p 1 2 ( y c H x)t W ( y c H x) 33 / 58

34 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear Following the minimization procedure as before, one obtains x = (H T WH) 1 H T W y c An initial guess x c is essential to begin the procedure. A stopping condition with an accuracy dependent tolerance is given by δj J i J i 1 J i < ε W where, ε is some prescribed small value. If the condition is not satisfied then the update procedure is iterated with the new estimate. 34 / 58

35 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear This procedure of differential correction has been widely used on a lot of problems. Convergence difficulties usually stem from any of the following sources. initial x-estimate if too far from the minimizing ˆx (for the nonlinearity of particular application) numerical difficulties in solving for corrections - arithmetic errors corrupting the algorithm or H matrix having fewer than n linearly independent rows or columns (i.e. rank deficient) The initial estimate convergence difficulty can be overcome by using Levenberg-Marquardt algorithm that combines the least squares differential correction process with a gradient search. 35 / 58

36 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear - Algorithm 36 / 58

37 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear - Example Consider the earlier example with the model, [ y k+1 = e a t] [ b ( y k + e a t 1) ] u k a Suppose we wish to determine, a and b directly. Clearly the parameters appear nonlinearly in the above equation so we use the nonlinear least squares algorithm. f k a f k b x = [a b] T ỹ = [ỹ 2 ỹ 3 ỹ 101 ] T [ f k = e a t] [ b ( y k + e a t 1) ] u k a The appropriate partials are given by, [ = t e a t] y k + = 1 a ( ) e a t 1 u k [ b a 2 ( 1 e a t) + b a tea t ] u k 37 / 58

38 Probability Concepts in Linear Batch Estimation Linear Sequential Estimation Nonlinear - Example Then the H matrix is given by, t [ e a t] [ b ( y 1 + ) 1 e a t + b ] a 2 a tea t t [ e a t] [ b ( y 2 + ) 1 e a t + b H = a 2 a tea t. t [ e a t] [ b ( y ) 1 e a t + b a 2 a tea t 1 ( u 1 e a t 1 ) u 1 a ( e a t 1 ) u 2 ] u 2 1 a ] u a. ( e a t 1 ) u 100 Using a starting guess x = [5 5] T and a stopping criterion of ɛ = , the NLS algorithm converges in 5 to 6 iterations, with â = and ˆb = Converting these to the discrete equivalents, ˆΦ = and ˆΓ = The example also illustrates why model choice is important in least squares estimation algorithms. 38 / 58

39 Probability Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Consider a single throw of a true die. The probability of occurrence of each of the events, 1, 2, 3, 4, 5 or 6 is exactly the same on a given throw. For a loaded or biased die, the probability of occurrence of certain events would be greater than others. If a given discrete-values experiment is conducted N times and N j is the number of times that the j th event x(j) occurred, then intuitively, the probability of occurrence of x(j) can be written as, p(x(j)) = lim N N j N For example, for the throw of a single die the probability of obtaining a value of 3 is given by p(3) = 1/6. A discrete-valued random variable, x, is defined as a function having finite number of possible values, x(j); with the associated probability of x(j) occuring being denoted by p(x(j)). More compactly, henceforth, x(j) and p(x(j)) would be referred to as x and p(x). (Attn: context is important!!) 39 / 58

40 Probability Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Expanding this for the case of a single throw of two dice. We now have 36 outcomes. Clearly the probability of the sum of the two dice being 7 is the highest. When n 2, it is difficult to produce all combinations. A generating function is helpful, f (x) = (x + x 2 + x 3 + x 4 + x 5 + x 6) n The coefficients of the powers of x are used to form the count. The probability is then the obtained by dividing the count by 6 n. Consider another experiment involving 4 flips of a coin. The number of ways, heads appears for the total of 16 (= 2 4 ) outcomes is calculated as follows. The number of ways to obtain x heads in n flips is the number of combinations of n things taken x at a time. Number of ways n C x = n! x!(n x)! For example if n = 4 and x = 2, the number of ways is equal to 6. The probability is 6 16 = / 58

41 Probability Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation A compound event is defined as the occurrence of either x(j) or x(k). The probability of such an event is computed as p(x(j) x(k)) = p(x(j)) + p(x(k)) p(x(j) x(k)) Note, x(j) x(k) denotes x(j) or x(k) and x(j) x(k) denotes x(j) and x(k) The probability of obtaining one event and another event is known as the joint probability. If x(j) and x(k) are mutually exclusive, then p(x(j) x(k)) = 0. For the discrete valued events, p(x(j)) is also termed the probability mass function. 0 p(x(j)) 1 p(x(j)) = 1 If events x(j) and x(k) are independent, then we have p(x(j) x(k)) = p(x(j))p(x(k)). j The conditional probability of x(j) given x(k) is denoted as p(x(j) x(k)). 41 / 58

42 Probability Concepts in Conditional Probability Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Suppose event x(k) has occurred. Then x(j) occurs if and only if x(j) and x(k)) occur. Thus, the probability of x(j), given that we know x(k) occurred should intuitively depend on p(x(j) x(k)). Similarly, p(x(j) x(k)) = p(x(k) x(j)) = p(x(j) x(k)) p(x(k)) p(x(j) x(k)) p(x(j)) Combining the two above gives us the Bayes rule. p(x(j) x(k))p(x(k)) = p(x(k) x(j))p(x(j)) p(x(j) x(k)) = p(x(k) x(j))p(x(j)) p(x(k)) This rule can be used to show some interesting counterintuitive results. For example, say 1 out of 1000 people have a rare disease. Tests show that 99% are positive when they have a disease and 2% are positive when they dont. What s the probability that someone actually has a disease when the test if positive? (0.047) 42 / 58

43 Probability Concepts in Random variables and statistics Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation The random variable x is usually described in terms of its moments. The first two moments of x are given by the mean µ of x: µ j x(j)p(x(j)) and the variance (σ 2 ) of x: σ 2 j (x(j) µ) 2 p(x(j)) σ is the standard deviation of x. The expected value of a function f (x) of a discrete random variable x is defined as, E{f (x)} = f (x(j))p(x(j)) j Clearly, the mean and variance are the expected values of the functions, f (x) = x and f (x) = (x µ) 2 respectively. The expectation operator E{ } is a linear operator, i.e., E{af (x) + bg(x)} = ae{f (x)} + be{g(x)} 43 / 58

44 Probability Concepts in Gaussian Random Processes Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation The most widely used distribution for state estimation involes the Gaussian random process. The Gaussian or normal density function for x: p(x) = 1 ] [ σ 2π exp (x µ)2 2σ 2 with mean given by µ and variance given by σ 2. For a multi-dimensional case for a vector x with an associated probability density [ function having mean µ 1 and covariance R, p(x) = exp 1 ] 1/2 [det(2πr)] 2 (x µ)t R 1 (x µ) The mean and standard deviation suffice to define the distribution. Thus, a simple notation for a Gaussian random variable is x N (µ, R) A stochastic process is simply a collection of random vectors defined on the same probability space. A zero-mean Gaussian white-noise process has the following properties, E{x} = 0 E{x(τ)x T (τ )} = Rδ(τ τ) where δ(τ τ) is the delta function. 44 / 58

45 Probability Concepts in Gaussian Random Processes Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation A process is supposed to be stationary if its random variable statistics do not vary in time, i.e., the probability statistics at time τ have the same mean and covariance as the probability statistics at time τ 45 / 58

46 Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Minimum Variance Estimation (a priori state estimates absent) For a measurable quantity, x, assume the following linear observation model, ỹ = Hx + v where, v N (0, R). We desire to estimate x as a linear combination of the measurements, ỹ as, ˆx = Mỹ + n We seek an optimum choice of M and n. The minimum variance definition of optimum M and n is that the variance of all estimates, ˆx i, from their respective true values is minimized. J i = 1 2 E{(ˆx i x i ) 2 } i = 1, 2,..., n From the first two equations and assuming perfect measurements (no measurement error, i.e. ỹ = Hx), we note that ˆx = MHx + n Thus for perfect estimates, MH = I and n = / 58

47 Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Minimum Variance Estimation (a priori state estimates absent) These are the constraints determined for the ideal estimation process, MH = I and n = 0. Then the desired estima tor is simply, ˆx = Mỹ The estimator is derived as follows. The error covariance matrix of the unbiased estimator is, P = E{(ˆx x) (ˆx x) T }. We wish to determine M that minimizes the above subject to the constraint that MH = I. We form the generalized loss function to be minimized: J = 1 ] [E{(ˆx 2 Tr x) (ˆx x) T } + Tr [Λ (I MH)] Λ is a matrix of Lagrange Multipliers. Solving the above minimization we obtain, ( ) 1 M = H T R 1 H H T R 1 Thus the desired estimator is: ( 1 ˆx = H T R H) 1 H T R 1 ỹ 47 / 58

48 Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Minimum Variance Estimation (a priori state estimates absent) ˆx = ( H T R 1 H) 1 H T R 1 ỹ The above is also referred to as the Gauss-Markov Theorem. The minimum variance estimator is identical to the least squares estimator provided that the weight matrix is identified as the inverse of the observation error covariance. Note, the sequential least squares solution takes on a similar form if R 1 has a block structure. 48 / 58

49 Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Minimum Variance Estimation (with a priori state estimates) We will now extend the above to allow a rigorous incorporation of a priori estimates, ˆx a. Assume the following linear observation model, ỹ = Hx + v where, v N (0, R). The covariance of R = E{vv T } Suppose x is unknown (a random variable). The a priori state estimates are given as ˆx a = x + w with associated (assumed known) a priori error covariance matrix, cov{w} Q = E{ww T }. Assumption: E{wv T } = 0 i.e., the measurement errors and the a priori estimate errors are uncorrelated. We desire to estimate, x as a linear combination of the measurements ỹ and the a priori estimates ˆx a, i.e., ˆx = Mỹ + N ˆx a + n As before we derive the constraints for perfect state estimates, MH + N = I and n = 0. Thus, ˆx = Mỹ + N ˆx a 49 / 58

50 Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Minimum Variance Estimation (with a priori state estimates) We form the generalized loss function to be minimized: J = 1 ] [E{(ˆx 2 Tr x) (ˆx x) T } + Tr [Λ (I MH N)] Λ is a matrix of Lagrange Multipliers. Solving the above minimization we obtain, ( M = H T R 1 H + Q 1) 1 H T R 1 and Thus N = ( H T R 1 H + Q 1) 1 Q 1 ˆx = (H T R 1 H + Q 1) 1 (H T R 1 ỹ + Q 1ˆx ) a 50 / 58

51 Probability Concepts in Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Minimum Variance Estimation (with a priori state estimates) Consider the limiting cases, ˆx = (H T R 1 H + Q 1) 1 (H T R 1 ỹ + Q 1ˆx ) a A priori knowledge is very poor, Q, Q 1 0 and R is finite. The estimator reduces to the standard minimum variance estimator. Lousy measurements, R, R 1 0 and Q is finite then, ˆx = ˆx a. Notice the parallels with sequential least squares estimation. 51 / 58

52 Probability Concepts in Maximum Likelihood Estimation Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation This method was introduced by R. A. Fisher, a geneticist and statistician in the 1920s. Maximum Likelihood Estimation yields estimates for the unknown quantities which maximize the probability (likelihood) of obtaining the observed set of data. Despite the fundamental differences with the minimum variance estimator, both approaches give the exact same results for least squares estimates under the assumption of zero-mean Gaussian noise measurement-error process. Assume the following linear observation model with deterministic x, ỹ = Hx + v v is a zero-mean Gaussian error process with covariance R. The mean (µ) of the measurements (ỹ), and the covariance, µ = E{Hx} + E{v} µ = Hx cov{ỹ} = E{(ỹ µ) (ỹ µ) T } = R 52 / 58

53 Probability Concepts in Maximum Likelihood Estimation Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Thus ỹ N (µ, R). We now use the multidimensional or the multivariate normal distribution for the likelihood (probability) function. [ 1 L(ỹ; x) = exp 1 ] (2π) m/2 1/2 [det(r)] 2 (ỹ Hx)T R 1 (ỹ Hx) The log likelihood function is given by, ln L(ỹ; x) = m 2 ln(2π) 1 2 ln [det(r)] 1 2 (ỹ Hx)T R 1 (ỹ Hx) For the optimization, we can ignore the first two terms on the right hand side as they dont depend on x. Note, maximizing negative of the above log likelihood to obtain ˆx is equivalent to minimizing, J(ˆx) = 1 2 (ỹ Hx)T R 1 (ỹ Hx) Thus, ˆx = estimate) ( H T R 1 H) 1 H T R 1 ỹ. (same as the minimum variance 53 / 58

54 Probability Concepts in Cramer Rao Inequality Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation This inequality can be used to give a lower bound on the expected errors between the estimated quantities and the true values from the known properties of the measurement errors. Let f (ỹ; x) be the probability density function of the sample ỹ. The Cramer-Rao inequality for an unbiased estimate ˆx is given by, P E{(ˆx x) (ˆx x) T } F 1 where, the Fisher Information Matrix is given by, { [ ] [ ] } T F = E ln f (ỹ; x) ln f (ỹ; x) x x The matrix can also be computed using the Hessian matrix, given by { } 2 F = E ln f (ỹ; x) x x T 54 / 58

55 Probability Concepts in Cramer Rao Inequality Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation For the Gauss-Markov theorem and the log likelihood function, i.e. Jx), we compute ( ) F = 2 x x J(x) = H T R 1 H T Hence the Cramer-Rao inequality is given by, Also, ˆx x = P ( ) 1 H T R 1 H ( H T R 1 H) 1 H T R 1 v Using E{vv T } = R gives us the error covariance, P = ( H T R 1 H) 1. The equality is satisfied. So, the least squares estimate from the Gauss-Markov Theorem is the most efficient possible estimate. 55 / 58

56 Probability Concepts in Bayesian Estimation Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Up until now we assumed the parameters to be estimated to be unknown constants. In Bayesian estimation, we consider the parameters are random variables with some a priori distribution. Bayesian estimation combines this a priori information with the measurements through a conditional density function of x given the measurements ỹ. This conditional function is known as the a posteriori distribution of x. Thus, Bayesian Estimation requires the probability density functions of both the measurement noise and the unknown parameters. From Bayes rule, f (x ỹ) = f (ỹ x) f (x) f (ỹ) (11) Since, ỹ is treated known, f (ỹ) is just a normalization factor to ensure that f (x ỹ) is a probability density function. Thus, f (ỹ) = f (ỹ x) f (x) dx. If this integral doesnt exist, then we let f (ỹ) = 1 so that f (x ỹ) is proper. 56 / 58

57 Probability Concepts in Bayesian Estimation Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Maximum a posteriori (MAP) estimation finds an estimate for x that maximizes equation (11). Since f (ỹ) doesnt depend on ˆx explicitly, this is equivalent to maximizing f (ỹ ˆx) f (ˆx). We again use the natural logrithm. J MAP (ˆx) = ln [f (ỹ; ˆx)] + ln [f (ˆx)] The first term in the sum is actually the log-likelihood function and the second term gives the a priori information on the to-be-determined parameters. Thus the MAP estimator maximizes, J MAP (ˆx) = ln [L((ỹ; ˆx)] + ln [f (ˆx)] Note, if the a priori distribution f (ˆx) is uniform, the MAP estimation is equivalent to the maximum likelihood estimation. 57 / 58

58 Probability Concepts in Bayesian Estimation Minimum variance estimation without a priori state estimates Minimum variance estimation with a priori state estimates Maximum Likelihood Estimation Cramer Rao Inequality Bayesian Estimation Let, [ 1 L(ỹ; ˆx) = exp 1 ] (2π) m/2 1/2 [det(r)] 2 (ỹ H ˆx)T R 1 (ỹ H ˆx) [ 1 f (ˆx) = exp 1 ] (2π) n/2 1/2 [det(q)] 2 (ˆx a ˆx)T Q 1 (ˆx a ˆx) Applying the maximization as per MAP estimation gives us the following estimator, ˆx = (H T R 1 H + Q 1) 1 (H T R 1 ỹ + Q 1ˆx ) a which is the same result obtained through the minimum variance. The solution through MAP estimation is much simpler though. Another approach for Bayesian Estimation is a Minimum Risk (MR) Estimator. Some practical difficulties such as convergence to the ML estimates for uniform a priori distributions is not guaranteed. For the Gaussian cases, this is identical to the MAP estimator. 58 / 58

Knowledge Discovery and Data Mining 1 (VO) ( )

Knowledge Discovery and Data Mining 1 (VO) (707.003) Review of Linear Algebra Denis Helic KTI, TU Graz Oct 9, 2014 Denis Helic (KTI, TU Graz) KDDM1 Oct 9, 2014 1 / 74 Big picture: KDDM Probability Theory