Mathematical statistics: Estimation theory

Size: px

Start display at page:

Download "Mathematical statistics: Estimation theory"

Geraldine Hart
5 years ago
Views:

1 Mathematical statistics: Estimation theory EE 30: Networked estimation and control Prof. Khan I. BASIC STATISTICS The sample space Ω is the set of all possible outcomes of a random eriment. A sample point is any member of Ω and is denoted by w. An event is any subset of Ω. We denote A as a class of subsets of Ω to which we can assign probabilities. For technical mathematical reasons it may not be possible to assign a probability to every subset of Ω an example is the Vitali set when Ω is R. However A is always taken to be a sigma-algebra which by definition is a non-empty class of events closed under countable unions intersections and complementation. A probability distribution or measure is a non-negative function P on A such that: i P Ω. ii If A A... are pairwise disjoint sets in A then P A i P A i. i The three objects Ω A P are collectively called the probability model. From now on we will assume Ω R n. Then the sigma-algebra A is the Borel sigma-algebra B n on R n. In short we will work this triple R n B n P. However we can follow similar arguments when the sample space is some well-defined subset of R n and the sigma-field is adjusted accordingly. Let B R Ω be measurable. A continuous random variable x is such that Px B has the form Px B pxdx for some integrable function p. Since Px Ω p must integrate to one. Since Px B 0 it can be shown that p is non-negative. A non-negative function that integrates to one is called a probability density function. We will assume that px is a continuous function. A probability distribution on R is defined by the relation P x x B i px dx. Since we assume the pdf to be continuous it can be verified that the distribution is also continuous. In fact defined as above P is absolutely continuous. April 08

2 In analogy with the univariate case we say that two random variables are jointly continuous with joint density px y such that Px y B px ydxdy for some integrable function p. Independent random variables x and x are such that for sets A B B the events x A and x B are independent. Lemma. Two random variables x and x are independent if and only if either one of the following is true: px x px px P x x P x P x. The conditional probability of x A given that x B has occurred is denoted by px A x B and is defined by px A x B px A x B. px B Clearly conditioning on an impossible event with zero probability does not make sense and hence the conditional probability is undefined when px B 0 There are generalizations but we will stick to this formalism. If x and x are independent we can easily see that px A x B px A. The mth moment of a random variable is defined as Ex m x m pxdx. Ω The first moment is called mean whereas the second moment is almost a variance with a minor adjustment. The variance of random variable is defined as B Ex Ex also sometimes referred to as the second central moment assuming that the first moment is defined <. Similarly we can define the mth central moment. Clearly the variance and the second moment are the same for a zero-mean random variable. The covariance between two random variables is defined as covx x Ex x Ex Ex. The above notion of continuous random variables can be easily extended to jointly continuous random vectors in R n. In general we have p : R n [0 ]. April 08

3 3 A random vector x R n has the mean µ Ex where each element of this vector is the ected value of the respective random variable. The covariance matrix Σ is defined as Σ E { x Exx Ex } E { xx xex Exx + ExEx } Exx ExEx ExEx + ExEx Exx ExEx. Clearly the diagonals are the variances of the random variable and the off-diagonals are cross covariance. By definition the covariance matrix is symmetric. Is it positive semi-definite? Lemma. The covariance matrix Σ is positive semi-definite. Proof. Consider some v R n : v Σv v E { x µx µ } v E { v x µx µ v } E { v x µv x µ } Ey 0 for y v x µ; this is because the second moment is always non-negative think about the mean of non-negative random variables. Two random variables are said to be uncorrelated if their covariance is zero. Recall covx x Ex x Ex Ex. Two random variables with Ex x 0 are only uncorrelated if they are zero-mean. Notice the relation between independent and uncorrelated random variables. In general being uncorrelated is much weaker than being independent. However for normal random variables the two are the same. Why? II. JOINTLY CONTINUOUS RANDOM VARIABLES Two random variables x and y are jointly continuous with joint density px y such that Px y B px ydxdy for some integrable function p. The marginal density of x is given by px px ydy and similarly for the marginal density of y. B April 08

4 4 III. NORMAL RANDOM VECTORS A random vector is said to be distributed according to a multivariate normal distribution when its probability density function is given by px R n π n Σ x µ Σ x µ. The density function is parameterized by the vector µ and the matrix Σ. It can be shown that Ex µ Ex Exx Ex Σ i.e. the constants µ and Σ are the mean and covariance matrix of the random vector x. Multivariate normal is arguably the most used and most well-behaved density function. Let us focus on the bivariate normal. Consider x and y to be jointly continuous random variables with joint normal density [ ] px y π Σ [ ] x µ x y µ y Σ x µ x. y µ y The matrix Σ is the covariance matrix of the density. Let [ ] Then and Σ [ Σ Σ σ x σ xy σ y σ xy σ xy σy σ xy σx Σ σ xσ y σ xy. The onent in the bivariate pdf becomes [ ] [ ] [ ] σy σ xy x µ x x µ x y µ y Σ σ xy σx y µ y [ ] [ ] σ x µ x y µ yx µ x σ xy y µ y y Σ σxy µ y σ xy x µ x Σ σyx µ x σ xy y µ y x µ x + σxy µ y σ xy x µ x y µ y σ yx µ x σ xy y µ y x µ x + σxy µ y σxσ y σxy x σyσ x µx σ xyy µ y x µ x y µy + σxσ y σxy σ x σxσ y σ y σyσ x u σ xy uv + v. σxσ y σxy σ x σ y ] April 08

5 5 Let ρ σ xy /σ x σ y then or the onent is ρ σ xy σ xσ y σ xσy σxy σxσ y ρ u ρuv + v. In short the bivariate normal density can be written as px y πσ x σ u ρuv + v. y ρ ρ What happens when ρ 0: ρ 0 σ xy 0 or that x and y are uncorrelated. We can easily see that when ρ 0 px y factors into px and py justifying that un-correlation in jointly normal variables implies independence. Lemma 3. Two normal random variables are independent if and only if they are uncorrelated. Lemma 4. The normal distribution is completely characterized by the first two moments. Remark. If x and y are jointly normal then the marginal of x and y is also normal. Remark. If x and y are jointly normal then the conditional density of x y and similarly y x is also normal. Remark 3. If x and y are independent normal random variables with Nµ x σ x and Nµ y σ y respectively then x + y is also a normal. In particular x + y Nµ x + µ y σ x + σ y. Remark 4. If x and y are independent random variables with densities p x and p y then the density of z x + y is given by the convolution of p x and p y. April 08

6 6 Proof of Remark. The marginal density of x is px px ydy πσ x σ u ρuv + v dy y ρ ρ u πσ x σ ρuv + v dy. y ρ ρ ρ For simplicity consider µ x µ y 0 and σ x σ y then x px π σxy xy + y dy σxy σxy σxy and the integral σ σxy xy x σxyx σ xy xy + y dy σxy y σ xyx σxyx dy σ xy x y σxy x dy σxy σ xy x σxy π σ xy σ xy because the integrand is like the pdf for a normal random variable with mean σ xy x and variance σxy. Now back to px: x σ px π xy x π σ σxy σxy σ xy xy x π σxy σ xyx σxy x π which is the pdf of a normal 0 random variable. Remarkably the marginal is independent of the covariance σ xy between x and y. April 08

7 7 Proof of Remark. Again for simplicity we consider standard normal variables zero-mean and variance one. px y px y py π σ xy π σxy π σxy π σxy π σxy σxy x σ xy xy + y π y σ xy σ xy σ xy x σ xy xy + y + y x σ xy xy + y σ xyy x σ xy xy + σ xyy σ xy x σ xyy which is normal with mean σ xy y and variance σxy. In general if x Nµ x σx and y Nµ y σy then x y N µ x + ρ σ x y µ y σ σ x ρ. Exercise y Again if x and y are uncorrelated then ρ 0 and x y Nµ x σ x. In other words px y px implying independence between x and y. April 08

8 8 Proof of Remark 3. The formal proof is left as an Exercise. Proof of Remark 4. To prove this let us first consider the characteristic function of a random variable. The characteristic function of a scalar random variable x is defined as ϕ x ν Ee jνx. By definition Ee jνx e jνx pxdx which is the Fourier transform F of the pdf px. Similarly we have px e jνx ϕ x νdν. π Clearly the characteristic function is unique for every pdf. Saying that a random variable has a pdf p is equivalent to saying that it has a characteristic function Fp. p z e jνz ϕ z νdν π e jνz ϕ x νϕ y νdν π [ ] e jνz ϕ x ν e jνy p y ydy dν π [ ] e jνz y ϕ x νdν p y ydy π p x z yp y ydy. Exercise. The second equality requires showing when z x + y. ϕ z ν ϕ x νϕ y ν April 08

9 9 IV. MAXIMUM-LIKELIHOOD ESTIMATORS The maximum likelihood principle is deceptively simple. Louis L. Scharf Let x denote a random variable whose pdf p θ x is parameterized by the unknown parameter θ. For example consider Fig. that shows two densities one with θ and the other with θ. p µ x p µ x bx x Fig.. Typical density functions. Suppose that x is observed. Based on the prior model p θ x Fig. we can say that x is more probably observed when θ θ than when θ θ. More generally there may be a unique value of θ for which x is more probably observed than for any other. We call this value of θ as the maximum likelihood estimate and denote it by θ: TexPoint fonts used in EMF. Read the TexPoint manual before you [ delete this ] box.: θ arg max p θ x. θ The function lθ x p θ x is called the likelihood function and its logarithm Lθ x lnp θ x 3 is called the log-likelihood function. When Lθ x is continuously differentiable in θ then the ML estimate may be determined by differentiating the log-likelihood function. The ML estimate is then called the root of the ML equation: θ Lθ x θ lnp θ x 0. 4 We will assume that there is only one value of θ for which the derivative is 0. April 08

10 0 Example. What is the ML estimate of θ when we observe y θ + r where r Nµ σ? First we compute Lθ y lnp θ y. We note that when y θ + r y is a normal random variable Nθ + µ σ. We have which implies that θ lnp θy θ { ln + y µ + θ πσ σ 4σ θ y µ + θ y µ + θ σ 0 θ y µ. } Example. What is the ML estimate of θ when we observe y θ n +r where r Nµ n σ I? In other words the noise vector r consists of i.i.d independent identically distributed random variables each distributed as Nµ σ. From our observation model we note that y Nθ + µ n σ I and y is also a collection of independent rvs. Hence the joint density of y is given by the product of the marginals: n lnp θ y ln y i µ + θ i πσ σ n ln y i µ + θ πσ σ This leads to i n i ln + πσ θ lnp θy 4σ σ 0. n i n i y i µ + θ σ θ y i µ + θ n y i µ θ i. Finally we obtain θ n n y i µ. i A very important point: So far θ is deterministic but unknown and θ is a random variable. April 08

11 Example 3. What is the mean of θ? E θ E n y i µ n i n Ey i µ n n θ. i n µ + θ µ i An estimator with the property that E θ θ is said to be an unbiased estimate. Example 4. What is the variance of θ? The following Eq. 5 requires an adjustment for the non-zero mean of y i s. For now assume µ 0. E θ n E y i n i n E y n i. 5 This is because all the cross-terms yield Ey i y j i j which is 0 since y i are i.i.d. It is straightforward to show that var θ E θ σ E θ n. i When n we are reduced to the first example. The estimate in the first example is also unbiased and has variance σ. As we add more observations n > the variance uncertainty in the estimate scales down by n. And goes to 0 as n. In this case the ML estimate has the property that its variance goes to 0 as n. Example 5. What is the distribution density of θ? Can be easily generalized. April 08

12 Example 6. What is the ML estimate of θ R n when we observe y Aθ + r where r R m Nµ Σ? Clearly A R m n. pr π n Σ r µ Σ r µ. 6 From the observation model y Aθ is distributed as Nµ Σ. We have the log-likelihood function: lnp θ y π n Σ y Aθ µ Σ y Aθ µ which leads to θ lnp θy θ y + Aθ + µ Σ y + Aθ + µ y + Aθ + µ Σ A + y + Aθ + µ Σ A 0. 0 y + Aθ + µ Σ A 0 A Σ y + Aθ + µ A Σ y A Σ µ A Σ Aθ θ A Σ A A Σ y A Σ µ. Recall Ax + b CDx + e x Dx + e C A + Ax + b CD. i Is the above estimate unbiased? ii What is the variance of the above estimate? iii What is the distribution of the above density? April 08

13 3 V. BAYESIAN ESTIMATORS Mother Nature conducts a random eriment that generates a parameter θ from a probability density function pθ. This parameter θ then encodes or parameterizes the conditional or measurement density fx θ. A random eriment generates a measurement x from fx θ. The problem is to estimate θ from x. We denote the estimate by θx. The Bayesian setup consists of the following notion. Loss function: The quality of the estimate θx is measured by a real-valued loss function. Some examples are: Quadratic loss function: Lθ θx [θ θx] [θ θx]. Binary 0 loss function: Lθ θx 0 if θx θ and otherwise. Risk: The risk can be defined as the average loss function over the density fx θ. The risk basically addresses what is the average loss or risk associated with the estimate θx. Mathematically Rθ θ E x Lθ θx Lθ θxfx θdx. The notation E x indicates that the ectation is over the distribution of the random measurement x with θ and θ fixed. Bayes risk: Bayes risk is the risk averaged over the prior distribution on θ: Rp θ E θ Rθ θ Rθ θpθdθ Lθ θx fx θdxpθdθ. }{{} fxθdxdθ Bayes Risk estimator: The Bayes risk estimator minimizes the Bayes risk: θ B arg min Rp θ θ i.e. the value of θ that minimizes the Bayes risk. April 08

14 4 The Bayes risk estimator is a rule for mapping the observations x into estimates θ B x. It depends on the conditional distribution of the measurements and the prior distribution of the parameter. When this prior is not known then the mini-max principle may be used. Mini-max estimator Suppose Mother Nature M does not like the erimentalist and tries to maximize the average risk for any choice θ: max Rp θ. p We can turn this into a game between M and E by allowing E to observe the resulting average risk and permitting him/her to choose a decision rule to minimize this max average risk: min θ max Rp θ. p The estimator that does this is called the mini-max estimator θ: θ arg min θ max Rp θ. p There are other variants of this setup and leads to some fundamental questions in game theory. April 08

15 5 Recall that the Bayes risk is given by Rp θ VI. COMPUTING BAYES RISK ESTIMATORS Lθ θxfx θdxdθ where fx θ fx θpθ. From Bayes rule we have fx θ fθ xfx where fθ x is the posterior density of θ given x and fx is the marginal density for x: fθ x fx fx θ fx θ fx fx pθ fx θdθ fx θpθdθ. There is an important physical interpretation of the first formula. The prior density is mapped to the posterior density by the ratio of the conditional measurement density to the marginal density pθ x fθ x i.e. the data x is used to map the prior into the posterior. The Bayes risk estimator is thus θ B x arg min Lθ θxfx θdxdθ θ arg min Lθ θxfθ xfxdxdθ θ arg min Lθ θxfθ xdθ fxdx θ arg min Lθ θxfθ xdθ θ }{{} Conditional Bayes risk since the marginal density fx is non-negative. The result says that the Bayes risk estimator is the estimator that minimizes the conditional risk; conditional risk is the loss averaged over the conditional distribution of θ given x. Now to compute a particular estimator we need to consider some typical loss functions. Quadratic loss function: When the loss function is quadratic: Lθ θx [θ θx] [θ θx] we may write the conditional Bayes risk as Lθ θxfθ xdθ [θ θx] [θ θx]fθ xdθ April 08

16 6 the gradient of the risk w.r.t θ is θ [θ θx] [θ θx]fθ xdθ θ [θ θx] [θ θx] fθ xdθ θ [θ θx]fθ xdθ and the second-derivative is [ ] θ θ [ [θ I > 0. θ The Bayes risk estimator now becomes [θ θx]fθ xdθ 0 θfθ xdθ θxfθ xdθ θ B x θfθ xdθ Eθ x. We say that the Bayes risk estimator under the quadratic loss is the conditional mean of the posterior i.e. θ given x. In a nutshell Bayes estimation under quadratic loss comes down to the computation of the mean of the conditional density fθ x. Nonlinear filtering is a generic term for this calculation because generally the result is a nonlinear function of the measurement x. Uniform loss function: Assume that the loss function is Lθ θx { 0 θ θx ε θ θx > ε where ε > 0. Based on this loss function the conditional Bayes risk is given by Lθ θxfθ xdθ ELθ θx P θ θx > ε + 0 P θ θx ε P θ θx > ε P θ θx ε θx+ε θx ε fθ xdθ. The above is minimized when the negative term is maximized: θx arg max θ In the limit that ε 0 the above becomes which is the MAP estimator. lim ε 0 θx+ε θx ε fθ xdθ. θx arg max fθ x θ April 08

17 7 Example. A radioactive source emits n radioactive particles and an imperfect Geiger counter records k n of them. Our problem is to estimate n from the measurements k. We assume that n is drawn from a Poisson distribution with known parameter λ: λ λn P [n] e n! n 0. The Poisson distribution characterizes the rate of emission of a process in a given interval of time or space. Its likely to see a large n when the ected number of occurrences λ is high and a small n when the ected number of occurrences λ is small. We can show that E[n] λ and En E[n] λ. The number of recorded counts follow a binomial distribution: P [k n] n k p k p n k 0 k n E[k n] np E[k n] np p. The Binomial distribution is the sum of i.i.d Bernoulli trials. Suppose a rv is with probability p and 0 with p. Then the binomial distribution characterizes what is the total number of s we may observe over n trials. In order to proceed with the Bayesian analysis we need to compute the posterior distribution of n k: P [n k] P [n k] P [k] which requires the joint and the marginals. We have n P [n k] P [k n]p [n] p k p n k λ λn e 0 k n n 0. k n! The marginal of k is n P [k] p k p n k λ λn e k n! k 0 n k0 n k0 n k0 λ n k λp k p n k e λ k!n k! λp k e λ k! λpk e λ+λ λp k! λp λpk e k! which is Poisson with rate λp. Now the posterior is P [n k] λ p n k n k! n! k!n k! pk p n k λ λn e n! e λp λpk k! n k! λ pn k e λ p n k April 08

18 8 which is similar to Poisson but n instead of starting from 0 starts from k. This has been called the Poisson distribution with displacement k. The conditional mean and variance are E[n k] λ p + k Exercise E[n E[n k] k] λ p Exercise. When the loss function is quadratic the optimal Bayes estimator is the conditional mean and thus n B E[n k] λ p + k. The Bayes estimate is k when p independent of the ected number of occurrences λ. Since our measurement model is Bernoulli we can show that P k n n when p. Similarly when p 0 i.e. we see no observations almost surely then the Bayes estimate is λ which is the ected number of occurrences. For 0 < p < Bayes estimate optimally combines the two extremes. We can also think of λ p as the ected number of missed counts in this sense Bayes estimate applies a correction to include the missed counts. One can easily show that E[ n B ] λ E[n] Exercise i.e. the estimate is unbiased. However this is not conditionally unbiased i.e. The mean squared error in the estimator is E[ n B n] E[k n] + λ p np + λ p n. E[n n B ] λ p Exercise. April 08

19 9 VII. MULTIVARIATE NORMAL Let x and y be jointly distributed according to the normal distribution: [ ] [ ] [ ] x 0 R xx R xy N y 0 Recall that the marginals are also normal i.e. R yx R yy x N0 R xx y N0 R yy where R xx E[xx ] and so on. It can be shown that y x NR yx R xx x R yy R yx R xx R xy x y NR xy R yy y R xx R xy R yy R yx. Hence the optimal Bayes estimate under quadratic loss is the mean of the posterior i.e. x B R xy Ryy y. We can think of this as Mother Nature generating x from px N0 R xx distribution and Father Nature generating a measurement from from fy x which is also normal. What function relating y to x will result into the above fy x? Recalling that the sum of two normal random variables is also normal note that y Hx + r with H R yx Rxx and r N0 Q statistically independent from x will result into the above fy x. In other words we can generate the jointly normal x and y process as described above by two statistically independent normal random vectors x N0 R xx and r N0 Q and by relating y and x as above. While generating this signal plus measurement model i.e. x being a signal and y Hx + r being the measurement we define one new matrix R yx and R yy is directly given by R xx and Q. Clearly R xy R yx. Show that R yy R xy R xx R xy + Q. In short one can generate a jointly normal random process by two independent normal processes and a linear map. April 08

20 0 VIII. LINEAR STATISTICAL MODEL: SIGNAL PLUS NOISE Let x Nµ x R xx r N0 R rr be uncorrelated with measurements: y Hx + r: µ y Ey Hµ x Ey x Hx Eyy EHx + rhx + r EHxx H + Err HR xx + µ x µ x H + R rr Ey µ y y µ y Eyy µ y µ y HR xx + µ x µ x H + R rr Hµ x µ x H HR xx H + R rr. So the marginal is y NHµ x HR xx H + R rr. To find the joint we need Ex µ x y µ y Ex µ x y Hµ x Exy Eµ x y Exµ x H + µ x µ x H Exx H + r Eµ x y Exµ x H + µ x µ x H R xx H + µ x µ x H µ x µ x H R xx H. Finally x and y are jointly normal: [ ] [ x N y µ x µ y ] [ R xx HR xx R xx H HR xx H + R rr ]. Recall that x y Nµ x + R xx H HR xx H + R rr y µ y R xx R xx H HR xx H + R rr HR xx and the optimal Bayes estimate under quadratic loss is x B Ex y µ x + R xx H HR xx H + R rr y µ y. April 08

21 Example. Let x be a scalar random variable distributed as px Nµ x σ x. Let y be a measurement such that y cx + r where c R r N0 σ r and Exr 0. To find the Bayes estimate under quadratic loss subsumed from here on one needs the posterior density fx y. First find the joint between x and y: Hence Ey Ecx + n cex cµ x Ex µ x y µ y Ex µ x cx + r µ y Ecx + µ x µ y xµ y cxµ x Ecx + cµ x cµ x cµ x cσ x Ey Ec x + cxr + r c σ x + µ x + σ r Ey µ y c σ x + µ x + σ r c µ x c σ x + σ r. [ x y ] x y N N [ µ x + Hence the Bayes estimate of x is cσ x µ x cµ x ] cσ x [ c σ x + σ r σ x cσ x cσ x c σ x + σ r ] y µ y σx c σx 4 c σx + σr x B µ x + y µ c σx + σr y µ x + y cµ c σx + σr x. The estimator is a linear function of the prior information i.e. µ x on x and y cµ x. What is y cµ x? This is often termed as innovation. One may think of cµ x as the predicted observation ŷ. Essentially from our model we predict what observation we should see something we already knew about the observation; and then subtract it from the actual observation y: Innovation Observed - Something that can be predicted in the observed; and hence the term innovation i.e. what is the new knowledge provided by the observation? On the innovations note that Ey cµ x 0 Ey cµ x Ey ycµ x + c µ x cσ x c σ x + µ x + σ r c µ x + c µ x c σ x + c µ x + σ r c µ x + c µ x c σ x + σ r.. April 08

22 What happens when we have perfect observation i.e. σ r 0 provided c 0? lim x B µ x + cσ x y cµ σr 0 c σx x µ x + c y cµ x y c. What happens when x is not random i.e. P x µ x? cσx lim x B lim µ x + y cµ σx 0 σx 0 c σx + σr x c µ x + y cµ x c + σ r σx µ x. What happens in between? lim c σx σr 0 lim c σx σr cσ x x B µ x + y µ c σx + σr y cσ x µ x + y cµ c σx + σr x c σx cσx µ c σx + σr x + c σx + σr x B µ x x B y c. c σ x σ r + µ x + The term c σx can be thought of as signal-to-noise ratio SNR a very important object in signal σr processing. In short several intuitive scenarios can be sketched using the above relations. c σ x σ r c σ x σ r + y c y April 08

23 3 A. Analysis Consider the following signal plus noise model: y Hx + n where x N0 R xx and n N0 R nn are statistically independent. The Bayes estimate under quadratic loss is the conditional mean of x y: with conditional covariance: x B R xx H HR xx H + R nn y }{{} G P R xx R xx H HR xx H + R nn HR }{{} xx Rxx + H RnnH G where the second equality comes from the matrix inversion lemma. We have and R xx GHR xx P P R xx + H R nnh GHR xx R xx P P P R xx I P R xx + H R nnhr xx I P I + H R nnhr xx I P H R nnhr xx G P H R nn. Hence the estimator can be re-written with x B P H R nny P R xx + H R nnh. April 08

24 4 IX. SEQUENTIAL BAYES The results of the previous section may be used to derive recursive estimates of the random vector x when the measurement vector y t [y 0 y... y t ] increases in dimension with time. The basic idea is to write [ y t H t x + n t ] [ ] [ ] y t y t H t c t x + n t n t i.e. the tth measurement can be written as y t c t x + n t where x N0 R xx and n t N0 r tt are statistically independent. Assume the noise to be i.i.d Gaussian n t N0 R t then R t is diagonal with elements r tt on the diagonal. This means that y t R t [ Rt 0 0 rtt The joint distribution of x and y t is [ ] [ ] [ ] x 0 R xx R xx Ht N. 0 H t R xx H t R xx Ht + R t The posterior is x y t N x t P t x t P t H t R t y t P t R xx + H t R t H t. ]. April 08

25 5 X. RECURSIVE BAYES The dimensions of H t R t and y t increase in time whereas they are fixed for x t and P t. How can we make the estimate equations recursive? [ ] [ ] Pt Rxx + [Ht Rt 0 H t c t ] 0 rtt c t With these recursions Moving on: R xx + H t R t H t + c t r tt c t Pt + c t rtt c t [ H t R t [H t c t ] [H t R t Rt 0 0 rtt c t r tt ]. P t x t H t R t y t [H t R t ] c t r tt ]y t H t R t y t + c t r tt y t P t x t + c t r tt y t P t x t P t P t The final filtering equations are: c t r tt c t x t + c t r tt y t. c t r tt c t x t + P t c t r tt y t x t P t c t r tt c t x t + P t c t r tt y t x t + P t c t rtt y }{{} t c k t t x t. x t x t + k t y t c t x t k t P t c t r tt P t P t + c t r tt c t. Notice the ctor y t c t x t where ŷ t c t x t is the predicted observation and thus ŷ t is the innovation. There is another key property of the innovations: t > t Ey t ŷ t y t ŷ t 0 Exercise Ey t ŷ t y t ŷ t E Ey t ŷ t y t ŷ t y t Hint i.e. the innovation sequence is uncorrelated over time as opposed to the observation sequence. Furthermore it can be shown that There is a correspondence between the observation sequence and the innovation sequence. From this x t Ex y t Ex y t ŷ t. April 08

Lecture 13 and 14: Bayesian estimation theory

1 Lecture 13 and 14: Bayesian estimation theory Spring 2012 - EE 194 Networked estimation and control (Prof. Khan) March 26 2012 I. BAYESIAN ESTIMATORS Mother Nature conducts a random experiment that generates