ECE531 Lecture 8: Non-Random Parameter Estimation D. Richard Brown III Worcester Polytechnic Institute 19-March-2009 Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 1 / 25
Introduction Recall the two basic classes of estimation problems: Known prior π(θ): Bayesian estimation. Unknown prior: Non-random parameter estimation. In non-random parameter estimation problems, we can still compute the risk of estimator ˆθ(y) when the true parameter is θ: ] R θ (ˆθ) := E θ [C θ (ˆθ(Y )) = C θ (ˆθ(y))p Y (y ; θ)dy Y where E θ means the expectation parameterized by θ and C θ (ˆθ) : Λ Λ R is the cost of the parameter estimate ˆθ Λ given the true parameter θ Λ. We cannot, however, compute any sort of average risk r(ˆθ) = E[R Θ (ˆθ)] since we have no distribution on the random parameter Θ. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 2 / 25
Uniformly Most Powerful Estimators? We would like to find a uniformly most powerful estimator ˆθ(y) that minimizes the conditional risk R θ (ˆθ) for all θ Λ. Example: Consider two distinct points in Λ: θ 0 θ 1. Suppose we have two estimators ˆθa (y) θ 0 for all y Y and ˆθ b (y) θ 1 for all y Y. Note that both of these estimator ignore the observation and always give the same estimate. For any of the cost functions we have considered, we know that C θ0 (ˆθ a (y)) = 0 C θ1 (ˆθ b (y)) = 0. For the MMSE or MMAE estimators, we also know that C θ1 (ˆθ a (y)) > 0 C θ0 (ˆθ b (y)) > 0. It should be clear that a uniformly most powerful estimator is not going to exist in most cases of interest. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 3 / 25
Some Options 1. We could restrict our attention to finding the sort of problems that do admit a uniformly most powerful estimator. 2. We could try find locally most powerful estimators. 3. We could assume a prior π(θ), e.g. perhaps some sort of least favorable prior, and solve the problem in the Bayes framework. 4. We could keep the problem non-random but place restrictions on the class of estimators that we are willing to consider. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 4 / 25
Option 4: Consider Only Unbiased Estimators A reasonable restriction on the class of estimators that we are willing to consider is the class of unbiased estimators. Definition An estimator ˆθ(y) is unbiased if for all θ Λ. Remarks: E θ [ˆθ(Y ) ] = θ This class excludes trivial estimators like ˆθ(y) θ 0. Under the squared-error cost assignment, the parameterized risk of estimators in this class R θ (ˆθ) = E θ h θ ˆθ(Y i ) 2 2 = X i E θi h (ˆθ i(y ) θ i) 2i = X i i var θi hˆθi(y ) The goal: find an unbiased estimator with minimum variance. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 5 / 25
Minimum Variance Unbiased Estimators Definition A minimum-variance unbiased estimator ˆθ mvu (y) is an unbiased estimator satisfying ˆθ mvu (y) = arg min R θ (ˆθ(y)) ˆθ(y) Ω for all θ Λ where Ω is the set of all unbiased estimators. Remarks: Finding an MVU estimator is still a multi-objective optimization problem. MVU estimators do not always exist. The class of problems in which MVU estimators do exist, however, is much larger than that of uniformly most powerful estimators. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 6 / 25
Example: Estimating a Constant in White Gaussian Noise Suppose we have random observations given by Y k = θ + W k k = 0,...,n 1 where W k i.i.d. N(0,σ 2 ). The unknown parameter θ can take on any value on the real line and we have no prior pdf. Suppose we generate estimates with the sample mean: ˆθ(y) = 1 n n 1 y k k=0 Is this estimator unbiased? Yes (easy to check). Is this estimator MVU? The ] variance of the estimator can be calculated as var θ [ˆθ(Y ) = σ2 n. But answering the question as to whether this estimator is MVU or not will require more work. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 7 / 25
A Generalized Model for Estimation Problems parameter space Λ θ p Y (y ; θ) probabilistic model estimator observation space y Y g(θ) RBLS ĝ(y) g[t(y)] estimator T(y) sufficient statistics Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 8 / 25
A Procedure for Finding MVU Estimators To find an MVU estimator for g(θ), we can follow a three-step procedure: 1. Find a complete sufficient statistic T for the family of pdfs {p Y (y ; θ); θ Λ} parameterized by θ. 2. Find any unbiased estimator ĝ(y) of g(θ). 3. Compute g[t(y)] = E θ [ĝ(y ) T(Y ) = T(y)]. The Rao-Blackwell-Lehmann-Sheffe Theorem says that g[t(y)] will be a MVU estimator of g(θ). Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 9 / 25
Sufficiency and Minimal Sufficiency Definition T : Y is a sufficient statistic for the family of pdfs {p Y (y ; θ); θ Λ} if the distribution of the random observation conditioned on T(Y ), i.e. p Y (y T(Y ) = t ; θ), does not depend on θ for all θ Λ and all t. Intuitively, a sufficient statistic summarizes the information contained in the observation about the unknown parameter. Knowing T(y) is as good as knowing the full observation y when we wish to estimate θ or g(θ). Definition T : Y is said to be minimal sufficient for the family of pdfs {p Y (y ; θ); θ Λ} if it is a function of every other sufficient statistic for this family of pdfs. Intuitively, a minimal sufficient statistic is the most concise summary of the observation. These can be hard to find and don t always exist. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 10 / 25
Example: Sufficiency (part 1) Suppose θ R and we get a vector observation y R n distributed as p Y (y ; θ) = ( n 1 1 1 X exp (2πσ 2 ) n/2 2σ 2 k=0 (y k θ) 2 ) = 1 exp (2πσ 2 ) n/2 j y 1θ 2 2 Is T(y) = y = [y 0,...,y n 1 ] sufficient? Intuitively, it should be. To gain some experience with the definition, however, let s check. What happens to p Y (y ; θ) when we condition on an observation Y = [y 0,...,y n 1 ]? 2σ 2 ff. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 11 / 25
Example: Sufficiency (part 2) Is T(y) = ȳ = 1 n n 1 k=0 y k sufficient? Note that Ȳ := 1 n 1 n k=0 Y k is a random variable distributed as N(θ,σ 2 /n). We can apply the definition directly... p Y (y ȳ ; θ) Bayes = pȳ (ȳ y ; θ)p Y (y ; θ) p Ȳ (ȳ ; θ) = δ ( ȳ 1 ) n yk py (y ; θ) p Ȳ (ȳ ; θ) ( = cδ ȳ 1 ) [ ( y 2 yk exp k n(ȳ) 2) ] n 2σ 2 where δ( ) is the Dirac delta function and c does not depend on θ. Hence T(y) = 1 n 1 n k=0 y k is a sufficient statistic. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 12 / 25
Neyman-Fisher Factorization Theorem Guessing and checking sufficient statistics isn t very satisfying. We need a procedure for finding sufficient statistics. Theorem (Fisher 1920, Neyman 1935) A statistic T is sufficient for θ if and only if there exist functions g θ and h such that the pdf of the observation can be factored as for all y Y and all θ Λ. p Y (y ; θ) = g θ (T(y))h(y) The Poor textbook gives the proof for the case when Y is discrete. A general proof can be found in Lehmann 1986. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 13 / 25
Example: Neyman-Fisher Factorization Theorem Suppose θ R and p Y (y ; θ) = { } n 1 1 1 (2πσ 2 exp ) n/2 2σ 2 (y k θ) 2. k=0 Let T(y) = 1 n 1 n k=0 y k = ȳ. We already know this is a sufficient statistic but let s try the factorization. { ( )} n 1 1 n 1 p Y (y ; θ) = (2πσ 2 exp ) n/2 2σ 2 yk 2 n 2θy k + θ 2 k=0 { 1 n ( = (2πσ 2 exp ) n/2 θ 2 2σ 2 2θȳ ) } { } n 1 1 exp 2σ }{{} 2 yk 2 k=0 }{{} g θ (T(y)) h(y) Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 14 / 25
Flashback: Sufficient Statistic For Simple Binary HT Suppose we have a simple binary hypothesis testing problem: { p 0 (y) θ = 0 Y p Y (y ; θ) = p 1 (y) θ = 1 Let T(y) = p 1(y) p 0 (y) g θ (T(y)) = θt(y) + (1 θ) h(y) = p 0 (y) Then it is easy to show that p Y (y ; θ) = g θ (T(y))h(y). Hence the likelihood ratio L(y) = p 1(y) p 0 (y) is a sufficient statistic for simple binary hypothesis testing problems. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 15 / 25
Completeness Definition The family of pdfs {p Y (y ; θ); θ Λ} is said to be complete if the condition E θ [f(y )] = 0 for all θ in Λ implies that Prob θ [f(y ) = 0] = 1 for all θ in Λ. Note that f : Y R can be any function. To get some intuition, consider the case where Y = {y 0,..., y L 1 } is a finite set. Then E θ [f(y )] = L 1 f(y l )Prob θ (Y = y l ) l=0 = f (y)p θ For a fixed θ it is certainly possible to find a non-zero f such that E θ [f(y )] = 0. But we have to satisfy this condition for all θ Λ, i.e. we need a vector f(y) that is orthogonal to the all members of the family of vectors {P θ ; θ Λ}. If the only such vector that satisfies the condition E θ [f(y )] = 0 for all θ Λ is f(y 0 ) = = f(y L 1 ) = 0, then the family {P θ ; θ Λ} is complete. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 16 / 25
Complete Sufficient Statistics Definition Suppose that T is a sufficient statistic for the family of pdfs {p Y (y ; θ); θ Λ}. Let p Z (z ; θ) denote the distribution of Z = T(Y ) when the parameter is θ. If the family of pdfs {p Z (z ; θ); θ Λ} is complete, then T is said to be a complete sufficient statistic for the family {p Y (y ; θ); θ Λ}. parameter space Λ θ p Y (y ; θ) probabilistic model observation space Y Y Z sufficient statistics Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 17 / 25
Example: Complete Sufficient Statistic Suppose θ R and p Y (y ; θ) = { } n 1 1 1 (2πσ 2 exp ) n/2 2σ 2 (y k θ) 2. k=0 Let T(y) = 1 n complete? n 1 k=0 y k. We know this is a sufficient statistic but is it We know the distribution of T(Y ) = Ȳ is N(θ,σ2 /n). We require E θ [f(ȳ )] = 0 for all θ Λ, i.e. { n n(x θ) 2} s(θ) = f(x) exp 2πσ 2σ 2 dx = 0 for all θ Λ Is there any non-zero f : R R that can do this? Suppose θ = 0. Lots of functions will do this, e.g. f(x) = x, f(x) = sin(x), etc. But we need s(θ) = 0 for all θ Λ. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 18 / 25
Example: Complete Sufficient Statistic (continued) s(θ) = s(θ) = { n n(x θ) 2} f(x) exp 2πσ 2σ 2 dx = 0 for all θ Λ f(x)exp { (θ x) 2} dx = 0 for all θ Λ But this is just the convolution of f(x) with a Gaussian pulse. Recall that convolution in the time domain is multiplication in the frequency domain. Hence, if S(ω) is the Fourier transform of s(θ), then S(ω) = F(ω)G(ω) = 0 for all ω where G(ω) is the Fourier transform of the Gaussian pulse. Note that G(ω) is itself Gaussian and therefore positive for all ω. Hence, the only way to force S(ω) 0 is to have F(ω) 0. Hence the only solution to E θ [f(ȳ )] = 0 for all θ Λ is f(x) 0 for all x and, consequently, T(y) is a complete sufficient statistic. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 19 / 25
Example: Incomplete Sufficient Statistic Suppose θ R and you would like to estimate θ from a scalar observation Y = θ + W where W U [ 1 2, 1 2]. An obvious sufficient statistic then is T(y) = y. But is it complete? Since T(Y ) = Y, we require E θ [f(y )] = 0 for all θ Λ, i.e. s(θ) = θ+ 1 2 θ 1 2 f(x)dx = 0 for all θ Λ Is there any non-zero f : R R that can do this? How about f(x) = sin(2πx)? This definitely forces s(θ) = 0 for all θ Λ. Just need to confirm Prob[f(Y ) = 0] < 1 for at least one θ Λ. Since we found a non-zero f(x) that forced E θ [f(y )] = 0 for all θ Λ, we can say that T(y) = y is not complete. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 20 / 25
Completeness Theorem for Exponential Families Theorem Suppose Y = R n, Λ R m, and { m } p Y (y ; θ) = a(θ)exp θ l T l (y) h(y) where a,t 1,...,T m, and h are all real-valued functions. Then T(y) = [T 1 (y),...,t m (y)] is a complete sufficient statistic for the family {p Y (y ; θ); θ Λ} if Λ contains an m-dimensional rectangle. Remarks: The technical detail about the m-dimensional rectangle ensures that Λ is not missing any dimensions in R m, e.g. Λ is not a two-dimensional plane in R 3. Main idea of proof is similar to how we showed completeness in the Gaussian example. See Poor pp. 165-166 and Lehmann 1986. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 21 / 25 l=1
Rao-Blackwell-Lehmann-Sheffe Theorem Theorem If ĝ(y) is any unbiased estimator of g(θ) and T is a sufficient statistic for the family {p Y (y ; θ); θ Λ}, then is g[t(y)] := E θ [ĝ(y ) T(Y ) = T(y)] A valid estimator of g(θ) (not a function of θ) An unbiased estimator of g(θ). Of lesser or equal variance than that of ĝ(y) for all θ Λ Additionally, if T is complete, then g[t(y)] is an MVU estimator of of g(θ). Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 22 / 25
Example: Estimating a Constant in White Gaussian Noise Suppose θ R and p Y (y ; θ) = { } n 1 1 1 (2πσ 2 exp ) n/2 2σ 2 (y k θ) 2. k=0 Let T(y) = 1 n 1 n k=0 y k. We know this is a complete sufficient statistic. Let s apply the RBLS theorem to find the MVU estimator... Suppose g(θ) = θ is just the identity mapping. We could choose the unbiased estimator ĝ(y) = y 0. Now we need to compute [ g[t(y)] := E θ [ĝ(y ) T(Y ) = T(y)] = E θ Y 0 1 Yk = 1 ] yk n n To solve this, we can use a standard formula for the conditional expectation of a jointly Gaussian random variable... Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 23 / 25
Example (continued) Suppose Z = [X,Y ] is jointly Gaussian distributed. It can be shown that E[X Y = y] = E[X] + cov(x,y ) (y E[Y ]). var(y ) In our problem, letting Ȳ = 1 n 1 n k=0 Y k, we can use this result to write [ E θ Y0 Ȳ = t] = E θ [Y 0 ] + cov θ(y 0,Ȳ ) ( 1 ) yk E θ [Ȳ ] ( = θ + σ2 1 σ 2 n = 1 n yk var θ (Y 0 ) n ) yk θ Hence ˆθ mvu (y) = 1 n yk is an MVU estimator of θ. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 24 / 25
Conclusions To approach non-random parameter estimation problems, we had to restrict our attention to the class of unbiased estimators. Under the squared error cost assignment, the performance of these unbiased estimators is measured by the variance of the estimates. Rao-Blackwell-Lehmann-Sheffe theorem establishes a procedure for finding minimum variance unbiased (MVU) estimators. Subtle concepts will require some practice: Sufficiency (Neyman-Fisher factorization theorem) Completeness Following RBLS doesn t guarantee you will find an MVU estimator: It can be difficult/impossible to find a complete sufficient statistic. It is often difficult to check completeness. Computing the conditional expectation can be intractable. Other techniques may be useful for checking if an estimator is MVU. Further restrictions on the class of estimators also facilitate analysis. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 25 / 25