2 Statistical Estimation: Basic Concepts

Size: px

Start display at page:

Download "2 Statistical Estimation: Basic Concepts"

Gregory Chapman
5 years ago
Views:

1 Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation: Basic Concepts 2.1 Probability We briefly remind some basic notions and notations from probability theory that will be required in this chapter. The Probability Space: The basic object in probability theory is the probability space (Ω, F, P), where Ω is the sample space (with sample points ω Ω), F is the (sigma-field) of possible events B F, and P is a probability measure, giving the probabilty P(B) of each possible event. A (vector-valued) Random Variable (RV) is a mapping : Ω IR n. is also required to be measurable on (Ω, F), in the sense that 1 (A) F for any open (or Borel) set A in IR n. In this course we shall not eplicitly define the underlying probability space, but rather define the probability distributions of the RVs of interest. 1

2 Distribution and Density: For an RV : Ω IR n, the (cumulative) probability distribution function (cdf) is defined as F () = P( ) = P{ω : (ω) }, IR n. The probability density function (pdf), if it eists, is given by p () = The RV s ( 1,..., k ) are independent if (and similarly for their densities). Moments: The epected value (or mean) of : n F () 1... n. F 1,..., k ( 1,..., k ) = E() = More generally, for a real function g on IR n, The covariance matrices: E(g()) = K F k ( k ) k=1 IR n df (). IR n g() df (). cov() = E{( )( ) T } cov( 1, 2 ) = E{( 1 1 )( 2 2 ) T }. When is scalar then cov() is simply its variance. The RV s 1 and 2 are uncorrelated if cov( 1, 2 ) = 0. 2

3 Gaussian RVs: A (non-degenerate) Gaussian RV on IR n has the density f () = 1 (2π) n/2 det(σ) 1 e 1/2 2 ( m)t Σ 1 ( m). It follows that m = E(), Σ = cov(). We denote N(m, Σ). 1 and 2 are jointly Gaussian if the random vector ( 1 ; 2 ) is Gaussian. It holds that: 1. Gaussian all linear combinations i a i i are Gaussian. 2. Gaussian y = A is Gaussian. 3. 1, 2 jointly Gaussian and uncorrelated 1, 2 are independent. Conditioning: For two events A, B, with P(B) > 0, define: P(A B) = P(A B) P(B). The conditional distribution of given y: The conditional density: F y ( y) = P( y = y) p y ( y) =. = lim ɛ 0 P( y ɛ < y < y + ɛ). n F y ( y) = p y(, y) n p y (y) In the following we simply write p( y) etc. when no confusion arises. 3

4 Conditional Epectation: E( y = y) = IR n p( y) d. Obviously, this is a function of y : E( y = y) = g(y). Therefore, E( y) = g(y) is an RV, and a function of y. Basic properties: Smoothing: E(E( y)) = E(). Orthogonality principle: E([ E( y)] h(y)) = 0 for every scalar function h. E( y) = E() if and y are independent. Bayes Rule: p( y) = p(, y) p(y) = p(y )p(). p(y )p() d 4

5 2.2 The Estimation Problem The basic estimation problem is: Compute an estimate for an unknown quantity X = IR n, based on measurements y = (y 1,..., y m ) IR m. Obviously, we need a model that relates y to. For eample, y = h() + v where h is a known function, and v a noise (or error) vector. An estimator ˆ for is a function ˆ : y ˆ(y). The value of ˆ(y) at a specific observed value y is an estimate of. Under different statistical assumptions, we have the following major solution concepts: (i) Deterministic framework: Here we simply look for that minimizes the error in y h(). The most common criterion is the square norm: min y h() 2 = min m y i h i () 2. This is the well-known (non-linear) least-squares (LS) problem. i=1 5

6 (ii) Non-Bayesian framework: Assume that y is a random function of. For eample, y = h() + v, with v an RV. More generally, we are given, for each fied, the pdf p(y ) (i.e., y p( )). No statistical assumptions are made on. The main solution concept here is the MLE. (iii) Bayesian framework: Here we assume that both y and are RVs with known joint statistics. The main solution concepts here are the MAP estimator and the optimal (MMSE) estimator. A problem related to estimation is the regression problem: given measurements ( k, y k ) N k=1, find a function h that gives the best fit y k h( k ). h is the regressor, or regression function. We shall not consider this problem directly in this course. 6

7 2.3 The Bayes Framework In the Bayesian setting, we are given: (i) p () the prior distribution for. (ii) p y (y ) the conditional distribution of y given =. Note that p(y ) is often specified through an equation such as y = h(, v) or y = h() + v, with v an RV, but this is immaterial for the theory. We can now compute the posterior probability of : p( y) = p(y )p() p(y )p() d. Given p( y), what would be a reasonable choice for ˆ? The two common choices are: (i) The mean of according to p( y): ˆ(y) = E( y) p( y) d. (ii) The most likely value of according to p( y): ˆ(y) = arg ma p( y) The first leads to the MMSE estimator, the second to the MAP estimator. 7

8 2.4 The MMSE Estimator The Mean Square Error (MSE) of as estimator ˆ is given by MSE(ˆ) = E( ˆ(y) 2 ). The Minimial Mean Square Error (MMSE) estimator, ˆ MMSE, is the one that minimizes the MSE. Theorem: ˆ MMSE (y) = E( y = y). Remarks: 1. Recall that conditional epectation E( y) satisfies the orthogonality principle (see above). This gives an easy proof of the theorem. 2. The MMSE estimator is unbiased: E(ˆ MMSE (y)) = E(). 3. The posterior MSE is defined (for every y) as: MSE (ˆ y) = E( ˆ(y) 2 y = y). with minimal value MMSE(y). Note that ( ) MSE(ˆ) = E E( ˆ(y) 2 y) = MSE(ˆ y)p(y)dy. y Since MSE(ˆ y) can be minimizing for each y separately, it follows that minimizing the MSE is equivalent to minimizing the posterior MSE for every y. Some shortcomings of the MMSE estimator are: Hard to compute (ecept for special cases). 8

9 May be inappropriate for multi-modal distributions. Requires the prior p(), which may not be available. Eample: The Gaussian Case. Let and y be jointly Gaussian RVs with means E() = m, E(y) = m y, and covariance matri cov = Σ y Σ y Σ y Σ yy. By direct calculation, the posterior distribution p y=y is Gaussian, with mean m y = m + Σ y Σ 1 yy(y m y ), and covariance Σ y = Σ Σ y Σ 1 yyσ y. (If Σ 1 yy does not eist, it may be replaced by the pseudo-inverse.) Note that the posterior variance Σ y does not depend on the actual value y of y! It follows immediately that for the Gaussian case, ˆ MMSE (y) E( y = y) = m y, and the associated posterior MMSE equals MMSE(y) = E( ˆ MMSE (y) 2 y = y) = trace(σ y ). Note that here ˆ MMSE is a linear function of y. Also, the posterior MMSE does not depend on y. 9

10 2.5 The Linear MMSE Estimator When the MMSE is too complicated we may settle for the best linear estimator. Thus, we look for ˆ of the form: ˆ(y) = Ay + b that minimizes ) MSE (ˆ) = E ( ˆ(y) 2. The solution may be easily obtained by differentiation, and has eactly the same form as the MMSE estimator for the Gaussian case: ˆ L (y) = m + Σ y Σ 1 yy(y m y ). Note: The LMMSE estimator depends only on the first and second order statistics of and y. The linear MMSE does not minimize the posterior MSE, namely MSE (ˆ y). This holds only in the Gaussian case, where the LMMSE and MMSE estimators coincide. The orthogonality principle here is: E ( ( ˆ L (y)) L(y) T ) = 0, for every linear function L(y) = Ay + b of y. The LMMSE is unbiased: E(ˆ L (y)) = E(). 10

11 2.6 The MAP Estimator Still in the Bayesian setting, the MAP (Maimum a-posteriori) estimator is defined as Noting that ˆ MAP (y) = arg ma p( y) = p(, y) p(y) we obain the equivalent characterizations: ˆ MAP (y) = arg ma = arg ma p( y). = p()p(y ) p(y) p(, y), p()p(y ). Motivation: Find the value of which has the highest probability according to the posterior p( y). Eample: In the Gaussian case, with p( y) N(m y, Σ y ), we have: ˆ MAP (y) = arg ma Hence, ˆ MAP ˆ MMSE for this case. p( y) = m y E( y = y). 11

12 2.7 The ML Estimator The MLE is defined in a non-bayesian setting: No prior p() is given. In fact, need not be random. The distribution p(y ) of y given is given as before. The MLE is defined by: ˆ ML (y) = arg ma X p(y ). It is convenient to define the likelihood function L y () = p(y ) and the log-likelihood function Λ y () = log L y (), and then we have Note: ˆ ML (y) = arg ma X L y() arg ma X Λ y(). Often is denoted as θ in this contet. Motivation: The value of that makes y most likely. This justification is merely heuristic! Compared with the MAP estimator: ˆ MAP (y) = arg ma p()p(y ), we see that the MLE lacks the weighting of p(y ) by p(). The power of the MLE lies in: its simplicity good asymptotic behavior. 12

13 Eample 1: y is eponentially distributed with rate > 0, namely = E(y) 1. Thus: F (y ) = (1 e y ) 1 {y 0} p y (y) = e y 1 {y 0} ˆ ML (y) = arg ma 0 e y d d ( e y ) = 0 = y 1 ˆ ML (y) = y 1. Eample 2 (Gaussian case): y = H + v (y IR m, IR n ) v N(0, R v ) L y () = p(y ) = 1 1 c e 2 (y H)T Rv 1 (y H) log L y () = c (y H)T Rv 1 (y H) ˆ ML = arg min (y H) T Rv 1 (y H). This is a (weighted) LS problem! By differentiation, H T R 1 v (y H) = 0, ˆ ML = (H T R 1 v H) 1 H T R 1 y (assuming that H T Rv 1 H is invertible: in particular, m n). v 13

14 2.8 Bias and Covariance Since the measurement y is random, the estimate ˆ = ˆ(y) is a random variable, and we can relate to its mean and variance. The conditional mean of ˆ is given by ˆm() = E(ˆ ) E(ˆ = ) = ˆ(y) p(y ) dy The bias ˆ is defined as b() = E(ˆ ). The of estimator ˆ is (conditionally) unbiased if b() = 0 for every X. The covariance matri of ˆ is, cov(ˆ ) = E((ˆ E(ˆ ))(ˆ E(ˆ ) = ) In the scalar case, it follows by orthogonality that MSE(ˆ ) E(( ˆ) 2 ) = E(( E(ˆ ) + E(ˆ ) ˆ) 2 ) = cov(ˆ ) + b() 2. Thus, if ˆ is conditionally unbiased, MSE(ˆ ) = cov(ˆ ). Similarly, if is vector-valued, then MSE(ˆ ) = trace(cov(ˆ )) + b() 2. In the Bayesian case, we say that ˆ is unbiased if E(ˆ(y)) = E(). Note that the first epectation is both over and y. 14

15 2.9 The Cramer-Rao Lower Bound (CRLB) The CRLB gives a lower bound on the MSE of any (unbiased) estimator. For illustration, we mention here the non-bayesian version, with a scalar parameter. Assume that ˆ is conditionally unbiased, namely E (ˆ(y)) =. (We use here E ( ) for E( X = )). Then MSE(ˆ ) = E {(ˆ(y) ) 2 } J() 1, where J is the Fisher information: { } J() = 2 ln p(y ) E { 2 ( ) } 2 ln p(y ) = E. An (unbiased) estimator that meets the above CRLB is said to be efficient. 15

16 2.10 Asymptotic Properties of the MLE Suppose is estimated based on multiple i.i.d. samples: y = y n = (y 1,..., y n ), with p(y n ) = n p 0 (y i ). i=1 For each n 1, let ˆ n denote an estimator based on y n. For eample, ˆ n = ˆ n ML. We consider the asymptotic properties of {ˆ n }, as n. Definitions: The (non-bayesian) estimator sequence {ˆ (n) } is termed: Consistent if: lim n ˆ n (y n ) = (w.p. 1). Asymptotically unbiased if: lim n E (ˆ n (y n )) =. Asymptotically efficient if it satisfies the CRLB for n, in the sense that: lim J n () MSE (ˆ n ) = 1. n Here MSE( n ) = E (ˆ n (y n ) ) 2 ), and J n is the Fisher information for y n. For i.i.d. observations, J n = nj (1). The ML Estimator ˆ n ML is both asymptotically unbiased and asymptotically efficient (under mild technical conditions). 16

Basic concepts in estimation

Basic concepts in estimation Random and nonrandom parameters Definitions of estimates ML Maimum Lielihood MAP Maimum A Posteriori LS Least Squares MMS Minimum Mean square rror Measures of quality of estimates