Lecture 13 and 14: Bayesian estimation theory

1 Lecture 13 and 14: Bayesian estimation theory Spring 2012 - EE 194 Networked estimation and control (Prof. Khan) March 26 2012 I. BAYESIAN ESTIMATORS Mother Nature conducts a random experiment that generates a parameter θ from a probability density function p(θ). This parameter θ then codes (or parameterizes) the conditional (or measurement) density f(x θ). A random experiment generates a measurement x from f(x θ). The problem is to estimate θ from x. We denote the estimate by (x). The Bayesian setup consists of the following notion. Loss function: The quality of the estimate (x) is measured by a real-valued loss function. Some examples are: Quadratic loss function: L(θ (x)) [θ (x)] T [θ (x]. Binary (01) loss function: L(θ (x)) 0 if (x θ and 1 otherwise. Risk: The risk can be defined as the average loss function over the density p(x θ). The risk basically addresses what is the average loss or risk associate with the estimate (x). Mathematically R(θ ) E x L(θ (x)) L(θ (x))f(x θ)dx. The notation E x indicates that the expectation is over the distribution of the random measurement x (with θ fixed). Bayes risk: Bayes risk is the risk averaged over the prior distribution on θ: R(p ) E θ R(θ ) R(θ )p(θ)dθ L(θ (x)) f(x θ)dxp(θ)dθ. }{{} f(xθ)dxdθ Bayes Risk estimator: The Bayes risk estimator minimizes the Bayes risk: B arg min R(p ) i.e. the value of that minimizes the Bayes risk.

2 The Bayes risk estimator is a rule for mapping the observations x into estimates B (x). It depends on the conditional distribution of the measurements and the prior distribution of the parameter. When this prior is not known then the mini-max principle may be used. Mini-max estimator Suppose an experimentalist chooses an estimate and Mother Nature (M) is allowed to choose her prior after the experimentalist (E) has made his/her choice. If Mother Nature does not like the experimentalist she will try to maximize the average risk for any choice : max R(p ). p We can turn this into a game between M and E by allowing E to observe the resulting average risk and permitting him/her to choose a decision rule to minimize this max average risk: min max R(p ). p The estimator that does this is called the mini-max estimator θ: θ arg min max R(p ). p There are other variants of this setup and leads to very fundamental questions in game theory. Recall that the Bayes risk is given by R(p ) II. COMPUTING BAYES RISK ESTIMATORS L(θ (x))f(x θ)dxdθ where f(x θ) f(x θ)p(θ). From Bayes rule we have f(x θ) f(θ x)f(x) where f(θ x) is the posterior density of θ given x and f(x) is the marginal density for x: f(θ x) f(x) f(x θ) f(x θ) f(x) f(x) p(θ) f(x θ)dθ f(x θ)p(θ)dθ. There is an important physical interpretation of the first formula. The prior density is mapped to the posterior density by the ratio of the conditional measurement density to the marginal density p(θ) x f(θ x)

3 i.e. the data x is used to map the prior into the posterior. The Bayes risk estimator is thus B (x) arg min L(θ (x))f(x θ)dxdθ arg min L(θ (x))f(θ x)f(x)dxdθ ( ) arg min L(θ (x))f(θ x)dθ f(x)dx arg min L(θ (x))f(θ x)dθ }{{} Conditional Bayes risk since the marginal density f(x) is non-negative. The result says that the Bayes risk estimator is the estimator that minimizes the conditional risk; conditional risk is the loss averaged over the conditional distribution of θ given x. Now to compute a particular estimator we need to consider some typical loss functions. Quadratic loss function: When the loss function is quadratic: L(θ (x)) [θ (x)] T [θ (x)] we may write the conditional Bayes risk as L(θ (x))f(θ x)dθ [θ (x)] T [θ (x)]f(θ x)dθ the gradient of the risk w.r.t is ( ) [θ (x)] T [θ (x)]f(θ x)dθ ( [θ (x)] T [θ (x)] ) f(θ x)dθ 2 [θ (x)]f(θ x)dθ and the second-derivative (curl?) is [ ] T ( ) [ T 2 [θ I > 0. The Bayes risk estimator now becomes 2 [θ (x)]f(θ x)dθ 0 θf(θ x)dθ (x)f(θ x)dθ B (x) θf(θ x)dθ E(θ x).

4 We say that the Bayes risk estimator under the quadratic loss function is the conditional mean of θ given x. In a nutshell Bayes estimation under quadratic loss comes down to the computation of the mean of the conditional density f(θ x). Nonlinear filtering is a generic term for this calculation because generally the result is a nonlinear function of the measurement x. Uniform loss function: Assume that the loss function is L(θ (x)) { 0 θ (x) ε 1 θ (x) > ε where ε > 0. Based on this loss function the expected posterior loss function becomes EL(θ (x)) 1 P ( θ (x) > ε) + 0 P ( θ (x) ε) P ( θ (x) > ε) 1 P ( θ (x) ε) 1 (x)+ε (x) ε f(θ x)dθ. The above is minimized when the negative term is maximized: (x) arg max θ In the limit that ε 0 the above becomes (x)+ε (x) ε f(θ x)dθ. which is the MAP estimator. lim ε 0 (x) arg max f(θ x) θ

5 Lecture 14: Wednesday Example 1: A radioactive source emits n radioactive particles and an imperfect Geiger counter records k n of them. Our problem is to estimate n from the measurements k. We assume that n is drawn from a Poisson distribution with known parameter λ: λ λn P [n] e n! n 0. The Poisson distribution characterizes the rate of emission of a process in a given interval of time (or space). Its likely to see a large n when the expected number of occurrences λ is high and a small n when the expected number of occurrences λ is small. We can show that E[n] λ and E((n E[n]) 2 ) λ. The number of recorded counts follow a binomial distribution: P [k n] ( n k ) p k (1 p) n k 0 k n E[k n] np E[k n] np(1 p). The Binomial distribution is the sum of i.i.d Bernoulli trials. Suppose a rv is 1 with probability p and 0 with 1 p. Then the binomial distribution characterizes what is the total number of 1 s we may observe over n trials. In order to proceed with the Bayesian analysis we need to compute the posterior distribution of n k: P [n k] P [n k] P [k] which requires the joint and the marginals. We have ( n P [n k] P [k n]p [n] )p k (1 p) n k λ λn e 0 k n n 0. k n! The marginal of k is P [k] ( n )p k (1 p) n k λ λn e k n! k 0 λ n k (λp) k (1 p) n k e λ k!(n k)! (λp) k e λ k! (λp)k e λ+λ λp k! λp (λp)k e k! (λ(1 p)) n k (n k)!

6 which is Poisson with rate λp. Now the posterior is P [n k] n! k!(n k)! pk (1 p) n k λ λn e n! e λp (λp)k k! 1 (n k)! (λ(1 p))n k e λ(1 p) n k which is similar to Poisson but n instead of starting from 0 starts from k. This has been called the Poisson distribution with displacement k. The conditional mean and variance are 1 E[n k] n (n k)! (λ(1 p))n k e λ(1 p) 1 (n k + k) (n k)! (λ(1 p))n k e λ(1 p) 1 (n k) (n k)! (λ(1 p))n k e λ(1 p) + λ(1 p)e λ(1 p) λ(1 p) + k; n k (n k)! (λ(1 p))n k 1 + ke λ(1 p) e λ(1 p) 1 k (n k)! (λ(1 p))n k e λ(1 p) E[(n E[n k]) 2 k] λ(1 p) Exercise. When the loss function is quadratic the optimal Bayes estimator is the conditional mean and thus n B E[n k] λ(1 p) + k. The Bayes estimate is k when p 1 independent of the expected number of occurrences λ. Since our measurement model is Bernoulli we can show that P (k n n) 1 when p 1. Similarly when p 0 i.e. we see no observations almost surely then the Bayes estimate is λ which is the expected number of occurrences. For 0 < p < 1 Bayes estimate optimally combines the two extremes. We can also think of λ(1 p) as the expected number of missed counts in this sense Bayes estimate applies a correction to include the missed counts. One can easily show that E[ n B ] λ E[n] i.e. the estimate is unbiased. However this is not conditionally unbiased i.e. The mean squared error in the estimator is E[ n B n] E[k n] + λ(1 p) np + λ(1 p) n. E[(n n B ) 2 ] E k ( E[(n nb ) 2 ] k ) E k ( E[(n E(n k)) 2 ] k ) E k (λ(1 p) k) λ(1 p).

7 III. MULTIVARIATE NORMAL Let x and y be jointly distributed according to the normal distribution: [ ] ([ ] [ ]) x 0 R xx R xy N y 0 Recall that the marginals are also normal i.e. R yx R yy x N(0 R xx ) y N(0 R yy ) where R xx E[xx T ] and so on. It can be shown that y x N(R yx R 1 xx x R yy R yx R 1 xx R xy ) x y N(R xy R 1 yy y R xx R xy R 1 yy R yx ). Hence the optimal Bayes estimate under quadratic loss is the mean of the posterior i.e. x B R xy Ryy 1 y. We can think of this as Mother Nature generating x from p(x) N(0 R xx ) distribution and Father Nature generating a measurement from from f(y x) which is also normal. What function relating y to x will result into the above f(y x)? Recalling that the sum of two normal random variables is also normal note that y Hx + r with H R yx Rxx 1 and r N(0 Q) statistically independent from x will result into the above f(y x). In other words we can generate the jointly normal x and y process as described above by two statistically independent normal random vectors x N(0 R xx ) and r N(0 Q) and by relating y and x as above. While generating this signal plus measurement model i.e. x being a signal and y Hx + r being the measurement we define one new matrix R yx and R yy is directly given by R xx and Q. Clearly R xy R T yx. Show that R yy R xy R 1 xx R T xy + Q. In short one can generate a jointly normal random process by two independent normal processes and a linear map.

8 IV. LINEAR STATISTICAL MODEL Consider the following signal plus noise model: y Hx + n where x N(0 R xx ) and n N(0 R nn ) are statistically independent. The correlation between x and y is R yx E[yx T ] E[(Hx + n)x T ] HR xx and the covariance of y is E[yy T ] HR xx H T + R nn. Thus x and y are jointly normal: [ ] ([ x N y 0 0 ] [ R xx HR xx R xx H T HR xx H T + R nn ]). Clearly the Bayes estimate under quadratic loss is the conditional mean of x y: with conditional covariance: x B R xx H T (HR xx H T + R nn ) 1 y }{{} G then P R xx R xx H T (HR xx H T + R nn ) 1 HR }{{} xx. G From the matrix inversion lemma note that P (R 1 nnh) 1 P 1 (R 1 nnh) GHR xx R xx P P (P 1 R xx I)(R 1 nnh) P ((R 1 nnh)r xx I) P (I + H T R 1 nnhr xx I) P H T R 1 nnhr xx G P H T R 1 nn. Hence the estimator can be re-written with x B P H T R 1 nny P (R 1 nnh) 1.

9 V. SEQUENTIAL BAYES The results of the previous section may be used to derive recursive estimates of the random vector x when the measurement vector y t [y 0 y 1... y t ] T increases in dimension with time. The basic idea is to write [ y t H t x + n t ] [ ] [ ] y t 1 y t H t 1 c T t x + n t 1 n t i.e. the kth measurement can be written as y k c T k x + n k where x N(0 R xx ) and n N(0 R t ) are statistically independent with R t being diagonal with elements r tt on the diagonal; R 00 r 00. This means that [ ] R 1 t Rt 1 1 0 0 T rtt 1 The joint distribution of x and y t is [ ] ([ ] [ ]) x 0 R xx R xx Ht T N. y t 0 H t R xx H t R xx Ht T + R t The posterior is x y N( x t P t ) x t P t H T t R 1 t y t P 1 t (R 1 xx + H T t R 1 t H t ). The dimensions of H t R t and y t increase in time whereas they are fixed for x t and P t. How can we make the estimate equations recursive?.