ECE531 Lecture 8: Non-Random Parameter Estimation

Similar documents
ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Homework Assignment Number 7 Solution

ECE531 Lecture 4b: Composite Hypothesis Testing

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

ECE531 Homework Assignment Number 6 Solution

Parameter Estimation

ECE531 Screencast 9.2: N-P Detection with an Infinite Number of Possible Observations

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

ECE531 Lecture 2b: Bayesian Hypothesis Testing

ECE531 Screencast 5.5: Bayesian Estimation for the Linear Gaussian Model

ECE531: Principles of Detection and Estimation Course Introduction

Chapter 8: Least squares (beginning of chapter)

Lecture 2: Basic Concepts of Statistical Decision Theory

ECE531: Principles of Detection and Estimation Course Introduction

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

ECE 275A Homework 6 Solutions

2 Statistical Estimation: Basic Concepts

ECE531 Lecture 12: Linear Estimation and Causal Wiener-Kolmogorov Filtering

10. Linear Models and Maximum Likelihood Estimation

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

ECE531 Lecture 13: Sequential Detection of Discrete-Time Signals

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

ECE531 Screencast 11.5: Uniformly Most Powerful Decision Rules

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

ENGG2430A-Homework 2

Mathematical Statistics

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

Advanced Signal Processing Introduction to Estimation Theory

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

ECE 275B Homework # 1 Solutions Version Winter 2015

A Very Brief Summary of Statistical Inference, and Examples

Mathematical statistics

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

An Introduction to Signal Detection and Estimation - Second Edition Chapter III: Selected Solutions

Lecture 7 Introduction to Statistical Decision Theory

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Hypothesis Testing: The Generalized Likelihood Ratio Test

First Year Examination Department of Statistics, University of Florida

Chapter 3: Maximum Likelihood Theory

Brief Review on Estimation Theory

Statistics for scientists and engineers

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 26. Estimation: Regression and Least Squares

STAT215: Solutions for Homework 2

ECE 275B Homework # 1 Solutions Winter 2018

ECE531 Lecture 11: Dynamic Parameter Estimation: Kalman-Bucy Filter

Mathematical statistics

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

ECE531 Lecture 10b: Dynamic Parameter Estimation: System Model

ECE 275A Homework 7 Solutions

A Very Brief Summary of Statistical Inference, and Examples

Estimation Theory Fredrik Rusek. Chapters 6-7

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

f(y θ) = g(t (y) θ)h(y)

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Estimation techniques

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

UCSD ECE250 Handout #27 Prof. Young-Han Kim Friday, June 8, Practice Final Examination (Winter 2017)

Lecture 8: Information Theory and Statistics

Estimation Tasks. Short Course on Image Quality. Matthew A. Kupinski. Introduction

ECE531 Lecture 2a: A Mathematical Model for Hypothesis Testing (Finite Number of Possible Observations)

f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

Parametric Techniques Lecture 3

LECTURE NOTE #3 PROF. ALAN YUILLE

A Few Notes on Fisher Information (WIP)

ESTIMATION THEORY. Chapter Estimation of Random Variables

10. Composite Hypothesis Testing. ECE 830, Spring 2014

6.1 Variational representation of f-divergences

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Lecture 4: UMVUE and unbiased estimators of 0

Ch. 5 Hypothesis Testing

STA 260: Statistics and Probability II

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

3.0.1 Multivariate version and tensor product of experiments

Parametric Techniques

Rowan University Department of Electrical and Computer Engineering

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

18.440: Lecture 28 Lectures Review

Master s Written Examination

Estimation and Detection

BEST TESTS. Abstract. We will discuss the Neymann-Pearson theorem and certain best test where the power function is optimized.

Probability on a Riemannian Manifold

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Appendix A : Introduction to Probability and stochastic processes

Lecture Notes 4 Vector Detection and Estimation. Vector Detection Reconstruction Problem Detection for Vector AGN Channel

Problem 1 (20) Log-normal. f(x) Cauchy

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

1. (Regular) Exponential Family

Statistics: Learning models from data

ST5215: Advanced Statistical Theory

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

F2E5216/TS1002 Adaptive Filtering and Change Detection. Course Organization. Lecture plan. The Books. Lecture 1

ECE 636: Systems identification

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Transcription:

ECE531 Lecture 8: Non-Random Parameter Estimation D. Richard Brown III Worcester Polytechnic Institute 19-March-2009 Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 1 / 25

Introduction Recall the two basic classes of estimation problems: Known prior π(θ): Bayesian estimation. Unknown prior: Non-random parameter estimation. In non-random parameter estimation problems, we can still compute the risk of estimator ˆθ(y) when the true parameter is θ: ] R θ (ˆθ) := E θ [C θ (ˆθ(Y )) = C θ (ˆθ(y))p Y (y ; θ)dy Y where E θ means the expectation parameterized by θ and C θ (ˆθ) : Λ Λ R is the cost of the parameter estimate ˆθ Λ given the true parameter θ Λ. We cannot, however, compute any sort of average risk r(ˆθ) = E[R Θ (ˆθ)] since we have no distribution on the random parameter Θ. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 2 / 25

Uniformly Most Powerful Estimators? We would like to find a uniformly most powerful estimator ˆθ(y) that minimizes the conditional risk R θ (ˆθ) for all θ Λ. Example: Consider two distinct points in Λ: θ 0 θ 1. Suppose we have two estimators ˆθa (y) θ 0 for all y Y and ˆθ b (y) θ 1 for all y Y. Note that both of these estimator ignore the observation and always give the same estimate. For any of the cost functions we have considered, we know that C θ0 (ˆθ a (y)) = 0 C θ1 (ˆθ b (y)) = 0. For the MMSE or MMAE estimators, we also know that C θ1 (ˆθ a (y)) > 0 C θ0 (ˆθ b (y)) > 0. It should be clear that a uniformly most powerful estimator is not going to exist in most cases of interest. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 3 / 25

Some Options 1. We could restrict our attention to finding the sort of problems that do admit a uniformly most powerful estimator. 2. We could try find locally most powerful estimators. 3. We could assume a prior π(θ), e.g. perhaps some sort of least favorable prior, and solve the problem in the Bayes framework. 4. We could keep the problem non-random but place restrictions on the class of estimators that we are willing to consider. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 4 / 25

Option 4: Consider Only Unbiased Estimators A reasonable restriction on the class of estimators that we are willing to consider is the class of unbiased estimators. Definition An estimator ˆθ(y) is unbiased if for all θ Λ. Remarks: E θ [ˆθ(Y ) ] = θ This class excludes trivial estimators like ˆθ(y) θ 0. Under the squared-error cost assignment, the parameterized risk of estimators in this class R θ (ˆθ) = E θ h θ ˆθ(Y i ) 2 2 = X i E θi h (ˆθ i(y ) θ i) 2i = X i i var θi hˆθi(y ) The goal: find an unbiased estimator with minimum variance. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 5 / 25

Minimum Variance Unbiased Estimators Definition A minimum-variance unbiased estimator ˆθ mvu (y) is an unbiased estimator satisfying ˆθ mvu (y) = arg min R θ (ˆθ(y)) ˆθ(y) Ω for all θ Λ where Ω is the set of all unbiased estimators. Remarks: Finding an MVU estimator is still a multi-objective optimization problem. MVU estimators do not always exist. The class of problems in which MVU estimators do exist, however, is much larger than that of uniformly most powerful estimators. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 6 / 25

Example: Estimating a Constant in White Gaussian Noise Suppose we have random observations given by Y k = θ + W k k = 0,...,n 1 where W k i.i.d. N(0,σ 2 ). The unknown parameter θ can take on any value on the real line and we have no prior pdf. Suppose we generate estimates with the sample mean: ˆθ(y) = 1 n n 1 y k k=0 Is this estimator unbiased? Yes (easy to check). Is this estimator MVU? The ] variance of the estimator can be calculated as var θ [ˆθ(Y ) = σ2 n. But answering the question as to whether this estimator is MVU or not will require more work. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 7 / 25

A Generalized Model for Estimation Problems parameter space Λ θ p Y (y ; θ) probabilistic model estimator observation space y Y g(θ) RBLS ĝ(y) g[t(y)] estimator T(y) sufficient statistics Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 8 / 25

A Procedure for Finding MVU Estimators To find an MVU estimator for g(θ), we can follow a three-step procedure: 1. Find a complete sufficient statistic T for the family of pdfs {p Y (y ; θ); θ Λ} parameterized by θ. 2. Find any unbiased estimator ĝ(y) of g(θ). 3. Compute g[t(y)] = E θ [ĝ(y ) T(Y ) = T(y)]. The Rao-Blackwell-Lehmann-Sheffe Theorem says that g[t(y)] will be a MVU estimator of g(θ). Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 9 / 25

Sufficiency and Minimal Sufficiency Definition T : Y is a sufficient statistic for the family of pdfs {p Y (y ; θ); θ Λ} if the distribution of the random observation conditioned on T(Y ), i.e. p Y (y T(Y ) = t ; θ), does not depend on θ for all θ Λ and all t. Intuitively, a sufficient statistic summarizes the information contained in the observation about the unknown parameter. Knowing T(y) is as good as knowing the full observation y when we wish to estimate θ or g(θ). Definition T : Y is said to be minimal sufficient for the family of pdfs {p Y (y ; θ); θ Λ} if it is a function of every other sufficient statistic for this family of pdfs. Intuitively, a minimal sufficient statistic is the most concise summary of the observation. These can be hard to find and don t always exist. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 10 / 25

Example: Sufficiency (part 1) Suppose θ R and we get a vector observation y R n distributed as p Y (y ; θ) = ( n 1 1 1 X exp (2πσ 2 ) n/2 2σ 2 k=0 (y k θ) 2 ) = 1 exp (2πσ 2 ) n/2 j y 1θ 2 2 Is T(y) = y = [y 0,...,y n 1 ] sufficient? Intuitively, it should be. To gain some experience with the definition, however, let s check. What happens to p Y (y ; θ) when we condition on an observation Y = [y 0,...,y n 1 ]? 2σ 2 ff. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 11 / 25

Example: Sufficiency (part 2) Is T(y) = ȳ = 1 n n 1 k=0 y k sufficient? Note that Ȳ := 1 n 1 n k=0 Y k is a random variable distributed as N(θ,σ 2 /n). We can apply the definition directly... p Y (y ȳ ; θ) Bayes = pȳ (ȳ y ; θ)p Y (y ; θ) p Ȳ (ȳ ; θ) = δ ( ȳ 1 ) n yk py (y ; θ) p Ȳ (ȳ ; θ) ( = cδ ȳ 1 ) [ ( y 2 yk exp k n(ȳ) 2) ] n 2σ 2 where δ( ) is the Dirac delta function and c does not depend on θ. Hence T(y) = 1 n 1 n k=0 y k is a sufficient statistic. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 12 / 25

Neyman-Fisher Factorization Theorem Guessing and checking sufficient statistics isn t very satisfying. We need a procedure for finding sufficient statistics. Theorem (Fisher 1920, Neyman 1935) A statistic T is sufficient for θ if and only if there exist functions g θ and h such that the pdf of the observation can be factored as for all y Y and all θ Λ. p Y (y ; θ) = g θ (T(y))h(y) The Poor textbook gives the proof for the case when Y is discrete. A general proof can be found in Lehmann 1986. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 13 / 25

Example: Neyman-Fisher Factorization Theorem Suppose θ R and p Y (y ; θ) = { } n 1 1 1 (2πσ 2 exp ) n/2 2σ 2 (y k θ) 2. k=0 Let T(y) = 1 n 1 n k=0 y k = ȳ. We already know this is a sufficient statistic but let s try the factorization. { ( )} n 1 1 n 1 p Y (y ; θ) = (2πσ 2 exp ) n/2 2σ 2 yk 2 n 2θy k + θ 2 k=0 { 1 n ( = (2πσ 2 exp ) n/2 θ 2 2σ 2 2θȳ ) } { } n 1 1 exp 2σ }{{} 2 yk 2 k=0 }{{} g θ (T(y)) h(y) Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 14 / 25

Flashback: Sufficient Statistic For Simple Binary HT Suppose we have a simple binary hypothesis testing problem: { p 0 (y) θ = 0 Y p Y (y ; θ) = p 1 (y) θ = 1 Let T(y) = p 1(y) p 0 (y) g θ (T(y)) = θt(y) + (1 θ) h(y) = p 0 (y) Then it is easy to show that p Y (y ; θ) = g θ (T(y))h(y). Hence the likelihood ratio L(y) = p 1(y) p 0 (y) is a sufficient statistic for simple binary hypothesis testing problems. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 15 / 25

Completeness Definition The family of pdfs {p Y (y ; θ); θ Λ} is said to be complete if the condition E θ [f(y )] = 0 for all θ in Λ implies that Prob θ [f(y ) = 0] = 1 for all θ in Λ. Note that f : Y R can be any function. To get some intuition, consider the case where Y = {y 0,..., y L 1 } is a finite set. Then E θ [f(y )] = L 1 f(y l )Prob θ (Y = y l ) l=0 = f (y)p θ For a fixed θ it is certainly possible to find a non-zero f such that E θ [f(y )] = 0. But we have to satisfy this condition for all θ Λ, i.e. we need a vector f(y) that is orthogonal to the all members of the family of vectors {P θ ; θ Λ}. If the only such vector that satisfies the condition E θ [f(y )] = 0 for all θ Λ is f(y 0 ) = = f(y L 1 ) = 0, then the family {P θ ; θ Λ} is complete. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 16 / 25

Complete Sufficient Statistics Definition Suppose that T is a sufficient statistic for the family of pdfs {p Y (y ; θ); θ Λ}. Let p Z (z ; θ) denote the distribution of Z = T(Y ) when the parameter is θ. If the family of pdfs {p Z (z ; θ); θ Λ} is complete, then T is said to be a complete sufficient statistic for the family {p Y (y ; θ); θ Λ}. parameter space Λ θ p Y (y ; θ) probabilistic model observation space Y Y Z sufficient statistics Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 17 / 25

Example: Complete Sufficient Statistic Suppose θ R and p Y (y ; θ) = { } n 1 1 1 (2πσ 2 exp ) n/2 2σ 2 (y k θ) 2. k=0 Let T(y) = 1 n complete? n 1 k=0 y k. We know this is a sufficient statistic but is it We know the distribution of T(Y ) = Ȳ is N(θ,σ2 /n). We require E θ [f(ȳ )] = 0 for all θ Λ, i.e. { n n(x θ) 2} s(θ) = f(x) exp 2πσ 2σ 2 dx = 0 for all θ Λ Is there any non-zero f : R R that can do this? Suppose θ = 0. Lots of functions will do this, e.g. f(x) = x, f(x) = sin(x), etc. But we need s(θ) = 0 for all θ Λ. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 18 / 25

Example: Complete Sufficient Statistic (continued) s(θ) = s(θ) = { n n(x θ) 2} f(x) exp 2πσ 2σ 2 dx = 0 for all θ Λ f(x)exp { (θ x) 2} dx = 0 for all θ Λ But this is just the convolution of f(x) with a Gaussian pulse. Recall that convolution in the time domain is multiplication in the frequency domain. Hence, if S(ω) is the Fourier transform of s(θ), then S(ω) = F(ω)G(ω) = 0 for all ω where G(ω) is the Fourier transform of the Gaussian pulse. Note that G(ω) is itself Gaussian and therefore positive for all ω. Hence, the only way to force S(ω) 0 is to have F(ω) 0. Hence the only solution to E θ [f(ȳ )] = 0 for all θ Λ is f(x) 0 for all x and, consequently, T(y) is a complete sufficient statistic. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 19 / 25

Example: Incomplete Sufficient Statistic Suppose θ R and you would like to estimate θ from a scalar observation Y = θ + W where W U [ 1 2, 1 2]. An obvious sufficient statistic then is T(y) = y. But is it complete? Since T(Y ) = Y, we require E θ [f(y )] = 0 for all θ Λ, i.e. s(θ) = θ+ 1 2 θ 1 2 f(x)dx = 0 for all θ Λ Is there any non-zero f : R R that can do this? How about f(x) = sin(2πx)? This definitely forces s(θ) = 0 for all θ Λ. Just need to confirm Prob[f(Y ) = 0] < 1 for at least one θ Λ. Since we found a non-zero f(x) that forced E θ [f(y )] = 0 for all θ Λ, we can say that T(y) = y is not complete. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 20 / 25

Completeness Theorem for Exponential Families Theorem Suppose Y = R n, Λ R m, and { m } p Y (y ; θ) = a(θ)exp θ l T l (y) h(y) where a,t 1,...,T m, and h are all real-valued functions. Then T(y) = [T 1 (y),...,t m (y)] is a complete sufficient statistic for the family {p Y (y ; θ); θ Λ} if Λ contains an m-dimensional rectangle. Remarks: The technical detail about the m-dimensional rectangle ensures that Λ is not missing any dimensions in R m, e.g. Λ is not a two-dimensional plane in R 3. Main idea of proof is similar to how we showed completeness in the Gaussian example. See Poor pp. 165-166 and Lehmann 1986. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 21 / 25 l=1

Rao-Blackwell-Lehmann-Sheffe Theorem Theorem If ĝ(y) is any unbiased estimator of g(θ) and T is a sufficient statistic for the family {p Y (y ; θ); θ Λ}, then is g[t(y)] := E θ [ĝ(y ) T(Y ) = T(y)] A valid estimator of g(θ) (not a function of θ) An unbiased estimator of g(θ). Of lesser or equal variance than that of ĝ(y) for all θ Λ Additionally, if T is complete, then g[t(y)] is an MVU estimator of of g(θ). Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 22 / 25

Example: Estimating a Constant in White Gaussian Noise Suppose θ R and p Y (y ; θ) = { } n 1 1 1 (2πσ 2 exp ) n/2 2σ 2 (y k θ) 2. k=0 Let T(y) = 1 n 1 n k=0 y k. We know this is a complete sufficient statistic. Let s apply the RBLS theorem to find the MVU estimator... Suppose g(θ) = θ is just the identity mapping. We could choose the unbiased estimator ĝ(y) = y 0. Now we need to compute [ g[t(y)] := E θ [ĝ(y ) T(Y ) = T(y)] = E θ Y 0 1 Yk = 1 ] yk n n To solve this, we can use a standard formula for the conditional expectation of a jointly Gaussian random variable... Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 23 / 25

Example (continued) Suppose Z = [X,Y ] is jointly Gaussian distributed. It can be shown that E[X Y = y] = E[X] + cov(x,y ) (y E[Y ]). var(y ) In our problem, letting Ȳ = 1 n 1 n k=0 Y k, we can use this result to write [ E θ Y0 Ȳ = t] = E θ [Y 0 ] + cov θ(y 0,Ȳ ) ( 1 ) yk E θ [Ȳ ] ( = θ + σ2 1 σ 2 n = 1 n yk var θ (Y 0 ) n ) yk θ Hence ˆθ mvu (y) = 1 n yk is an MVU estimator of θ. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 24 / 25

Conclusions To approach non-random parameter estimation problems, we had to restrict our attention to the class of unbiased estimators. Under the squared error cost assignment, the performance of these unbiased estimators is measured by the variance of the estimates. Rao-Blackwell-Lehmann-Sheffe theorem establishes a procedure for finding minimum variance unbiased (MVU) estimators. Subtle concepts will require some practice: Sufficiency (Neyman-Fisher factorization theorem) Completeness Following RBLS doesn t guarantee you will find an MVU estimator: It can be difficult/impossible to find a complete sufficient statistic. It is often difficult to check completeness. Computing the conditional expectation can be intractable. Other techniques may be useful for checking if an estimator is MVU. Further restrictions on the class of estimators also facilitate analysis. Worcester Polytechnic Institute D. Richard Brown III 19-March-2009 25 / 25