Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Similar documents
Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

Chapter 3. Point Estimation. 3.1 Introduction

Evaluating the Performance of Estimators (Section 7.3)

Theory of Statistics.

Mathematical Statistics

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

STAT 512 sp 2018 Summary Sheet

Proof In the CR proof. and

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

HT Introduction. P(X i = x i ) = e λ λ x i

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

Problem Selected Scores

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

ST5215: Advanced Statistical Theory

6.1 Variational representation of f-divergences

Chapter 8.8.1: A factorization theorem

A Very Brief Summary of Statistical Inference, and Examples

Classical Estimation Topics

Statistics and Econometrics I

ECE 275B Homework # 1 Solutions Version Winter 2015

ECE531 Lecture 10b: Maximum Likelihood Estimation

STAT 730 Chapter 4: Estimation

ECE 275B Homework # 1 Solutions Winter 2018

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Elements of statistics (MATH0487-1)

Math 494: Mathematical Statistics

A Few Notes on Fisher Information (WIP)

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

A Very Brief Summary of Statistical Inference, and Examples

Brief Review on Estimation Theory

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Lecture 4: UMVUE and unbiased estimators of 0

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Regression Estimation Least Squares and Maximum Likelihood

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1


A General Overview of Parametric Estimation and Inference Techniques.

Central Limit Theorem ( 5.3)

Parametric Inference

Spring 2012 Math 541B Exam 1

Limiting Distributions

Introduction to Estimation Methods for Time Series models Lecture 2

Exercises and Answers to Chapter 1

Lecture 1: Introduction

EIE6207: Estimation Theory

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

APPM/MATH 4/5520 Solutions to Exam I Review Problems. f X 1,X 2. 2e x 1 x 2. = x 2

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

5.2 Fisher information and the Cramer-Rao bound

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

ECE 275A Homework 7 Solutions

Advanced Signal Processing Introduction to Estimation Theory

Statistics 3858 : Maximum Likelihood Estimators

1. (Regular) Exponential Family

ECE 275A Homework 6 Solutions

Masters Comprehensive Examination Department of Statistics, University of Florida

STAT215: Solutions for Homework 2

Math 181B Homework 1 Solution

Generalized Linear Models Introduction

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

Example: An experiment can either result in success or failure with probability θ and (1 θ) respectively. The experiment is performed independently

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Joint Probability Distributions and Random Samples (Devore Chapter Five)

ECE534, Spring 2018: Solutions for Problem Set #3

Review and continuation from last week Properties of MLEs

Last Lecture. Biostatistics Statistical Inference Lecture 14 Obtaining Best Unbiased Estimator. Related Theorems. Rao-Blackwell Theorem

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

Probability Theory and Statistics. Peter Jochumzen

simple if it completely specifies the density of x

f (1 0.5)/n Z =

1. Fisher Information

Week 9 The Central Limit Theorem and Estimation Concepts

Loglikelihood and Confidence Intervals

Limiting Distributions

Master s Written Examination

Week 2: Review of probability and statistics

Probability and Distributions

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

ECE531 Lecture 8: Non-Random Parameter Estimation

f(y θ) = g(t (y) θ)h(y)

Final Examination Statistics 200C. T. Ferguson June 11, 2009

Qualifying Exam in Probability and Statistics.

Fisher Information & Efficiency

Master s Written Examination

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

MAS223 Statistical Inference and Modelling Exercises

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Chapters 9. Properties of Point Estimators

6.1 Moment Generating and Characteristic Functions

Estimation and Detection

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

Chapter 4. Continuous Random Variables

Transcription:

Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it has uniformly smaller MSE: for all θ. MSEˆθ (θ) MSE θ (θ) Normally we also require that the inequality be strict for at least one θ. 135

Question: is there a best estimate one which is better than every other estimator? Answer: NO. Suppose ˆθ were such a best estimate. Fix a θ in Θ and let θ θ. Then MSE of θ is 0 when θ = θ. better than θ we must have Since ˆθ is MSEˆθ (θ ) = 0 so that ˆθ = θ with probability equal to 1. So ˆθ = θ. If there are actually two different possible values of θ this gives a contradiction; so no such ˆθ exists. 136

Principle of Unbiasedness: A good estimate is unbiased, that is, E θ (ˆθ) θ. WARNING: In my view the Principle of Unbiasedness is a load of hog wash. For an unbiased estimate the MSE is just the variance. Definition: An estimator ˆφ of a parameter φ = φ(θ) is Uniformly Minimum Variance Unbiased (UMVU) if, whenever φ is an unbiased estimate of φ we have Var θ (ˆφ) Var θ ( φ) We call ˆφ the UMVUE. ( E is for Estimator.) The point of having φ(θ) is to study problems like estimating µ when you have two parameters like µ and σ for example. 137

Cramér Rao Inequality Suppose T (X) is some unbiased estimator of θ. We can derive some information from the identity E θ (T (X)) θ When we worked with the score function we derived some information from the identity f(x, θ)dx 1 by differentiation and we do the same here. Since T (X) is is an unbiased estimate for θ then E θ (T (X)) = Differentiate both sides to get 1 = d dθ = = T (x)f(x, θ)dx T (x) f(x, θ)dx θ T (x) θ T (x)f(x, θ)dx θ log(f(x, θ))f(x, θ)dx = E θ (T (X)U(θ)) where U is the score function. 138

Remember: Here Cov(W, Z) = E(W Z) E(W )E(Z)A Cov θ (T (X), U(θ)) = E(T (X)U(θ)) E(T (X))E(U(θ)) But recall that score U(θ) has mean 0 so: Cov θ (T (X), U(θ)) = 1 Definition of correlation gives: {Corr(W, Z)} 2 = {Cov(W, Z)}2 Var(W )Var(Z) Correlations squared are less than 1 therefore Remember Therefore Var θ (T )Var θ (U(θ)) > 1. Var(U(θ)) = I(θ) Var θ (T ) 1 I(θ). RHS called the Cramér Rao Lower Bound. 139

Examples of Cramér Rao Lower Bound: 1) X 1,..., X n iid Exponential with mean µ. f(x) = 1 µ exp{ x/µ} for x > 0 Log-likelihood: Score (U(µ) = l/ µ): l = n log µ X i /µ U(µ) = n µ + Xi µ 2 Negative second derivative: V (µ) = U (µ) = n µ 2 + 2 Xi µ 3 140

Take expected value to compute Fisher information I(µ) = n µ 2 + 2 E(Xi ) µ 3 = n µ 2 + 2nµ µ 3 = n µ 2 So if T (X 1,..., X n ) is an unbiased estimator of µ then Var(T ) 1 I(µ) = µ2 n Example: X is an unbiased estimate of µ Also: Var( X) = µ2 n SO: X is the best unbiased estimate. It has the smallest possible variance. We say it is a Uniformly Minimum Variance Unbiased Estimator of µ. 141

Similar ideas with more than 1 parameter. Example: X 1,..., X n sample from N(µ, σ 2 ). Suppose (ˆµ, ˆσ 2 ) is an unbiased estimator of (µ, σ 2 ). (Such as ( X, s 2 ).) Estimating σ 2 not σ. So give symbol for σ 2. Define τ = σ 2. Find information matrix because CRLB is its inverse. Log-likelihood: Score: l = n 2 log τ (Xi µ) 2 U = 2τ (Xi µ) τ n 2τ + (Xi µ) 2 2τ 2 n 2 log(2π) 142

Negative second derivative matrix: V = n τ (Xi µ) τ 2 Fisher information matrix I(µ, τ) = Cramér-Rao lower bound: (Xi µ) τ 2 n 2τ + (Xi µ) 2 τ 3 [ nτ 0 0 n 2τ 2 ] Var(ˆµ, ˆσ 2 ) {I(µ, τ)} 1 [ τn ] 0 = 0 2τ 2 n In particular Var(ˆσ 2 ) 2τ 2 n = 2σ4 n 143

Notice: Var(s 2 ) = Var ( (n 1)s 2 σ 2 ) σ 2 n 1 = σ4 (n 1) 2Var(χ2 n 1 ) = 2σ4 n 1 > 2σ4 n Conclusions: Variance of sample variance larger than lower bound. But ratio of variance to lower bound is n/(n 1) which is nearly 1. Fact: s 2 is UMVUE anyway; see later in course. 144

Slightly more general. If E(T ) = φ(θ) for some ftn φ then similar argument gives: Var(T ) ( φ (θ) ) 2 I(θ) Inequality is strict unless corr = 1 so that U(θ) = A(θ)T (X) + B(θ) for non-random constants A and B (may depend on θ.) This would prove that l(θ) = A (θ)t (X) + B (θ) + C(X) for other constants A and B and finally for h = e C. f(x, θ) = h(x)e A (θ)t (x)+b (θ) 145

Summary of Implications You can recognize a UMVUE sometimes. If Var θ (T (X)) 1/I(θ) then T (X) is the UMVUE. In the N(µ, 1) example the Fisher information is n and Var(X) = 1/n so that X is the UMVUE of µ. In an asymptotic sense the MLE is nearly optimal: it is nearly unbiased and (approximate) variance nearly 1/I(θ). Good estimates are highly correlated with the score. Densities of exponential form (called exponential family) given above are somehow special. Usually inequality is strict strict unless score is affine function of a statistic T and T (or T/c for constant c) is unbiased for θ. 146

What can we do to find UMVUEs when the CRLB is a strict inequality? Use Sufficiency: choose good summary statistics. recognize unique good statis- Completeness: tics. Rao-Blackwell theorem: mechanical way to improve unbiased estimates. way to prove esti- Lehman-Scheffé theorem: mate is UMVUE. 147

Sufficiency Example: Suppose X 1,..., X n iid Bernoulli(θ) and T (X 1,..., X n ) = X i Consider conditional distribution of X 1,..., X n given T. Take n = 4. P (X 1 = 1, X 2 = 1, X 3 = 0, X 4 = 0 T = 2) = P (X 1 = 1, X 2 = 1, X 3 = 0, X 4 = 0, T = 2) P (T = 2) = P (X 1 = 1, X 2 = 1, X 3 = 0, X 4 = 0) P (T = 2) = p2 (1 p) 2 ( 4) p 2 (1 p) 2 2 = 1 6 Notice disappearance of p! This happens for all possibilities for n, T and the Xs. 148

In the binomial situation we say the conditional distribution of the data given the summary statistic T is free of θ. Defn: Statistic T (X) is sufficient for the model {P θ ; θ Θ} if conditional distribution of data X given T = t is free of θ. Intuition: Data tell us about θ if different values of θ give different distributions to X. If two different values of θ correspond to same density or cdf for X we cannot distinguish these two values of θ by examining X. Extension of this notion: if two values of θ give same conditional distribution of X given T then observing T in addition to X doesn t improve our ability to distinguish the two values. 149

Theorem: [Rao-Blackwell] Suppose S(X) is a sufficient statistic for model {P θ, θ Θ}. If T is an estimate of φ(θ) then: 1. E(T S) is a statistic. 2. E(T S) has the same bias as T ; if T is unbiased so is E(T S). 3. Var θ (E(T S)) Var θ (T ) and the inequality is strict unless T is a function of S. 4. MSE of E(T S) is no more than MSE of T. 150

Usage: Review conditional distributions: Defn: if X, Y are rvs with joint density then E(g(Y ) X = x) = g(y)f Y X (y x)dy and E(Y X) is this function of x evaluated at X Important property: E {R(X)E(Y X)} = = = R(x)E(Y X = x)f X (x)dx R(x)yf X (x)f(y x)dydx R(x)yf X,Y (x, y)dydx = E(R(X)Y ) Think of E(Y X) as average Y holding X fixed. Behaves like ordinary expected value but functions of X only are like constants: E( A i (X)Y i X) = A i (X)E(Y i X) 151

Examples: In the binomial problem Y 1 (1 Y 2 ) is an unbiased estimate of p(1 p). We improve this by computing E(Y 1 (1 Y 2 ) X) We do this in two steps. First compute E(Y 1 (1 Y 2 ) X = x) 152

Notice that the random variable Y 1 (1 Y 2 ) is either 1 or 0 so its expected value is just the probability it is equal to 1: E(Y 1 (1 Y 2 ) X = x) = P (Y 1 (1 Y 2 ) = 1 X = x) = P (Y 1 = 1, Y 2 = 0 Y 1 + Y 2 + + Y n = x) = P (Y 1 = 1, Y 2 = 0, Y 1 + + Y n = x) P (Y 1 + Y 2 + + Y n = x) = P (Y 1 = 1, Y 2 = 0, Y 3 + + Y n ( = x 1) n) p x (1 p) n x x p(1 p) ( n 2) p x 1 (1 p) (n 2) (x 1) = x ( 1 n) p x (1 p) n x = = ( n 2 x 1 ( n) ) x x(n x) n(n 1) x This is simply nˆp(1 ˆp)/(n 1) (can be bigger than 1/4, the maximum value of p(1 p)). 153

Example: If X 1,..., X n are iid N(µ, 1) then X is sufficient and X 1 is an unbiased estimate of µ. Now E(X 1 X) = E[X 1 X + X X] = E[X 1 X X] + X = X which is (later) the UMVUE. 154

Finding Sufficient statistics Binomial(n, θ): log likelihood l(θ) (part depending on θ) is function of X alone, not of Y 1,..., Y n as well. Normal example: containing µ, l(µ) is, ignoring terms not l(µ) = µ X i nµ 2 /2 = nµ X nµ 2 /2. Examples of the Factorization Criterion: Theorem: If the model for data X has density f(x, θ) then the statistic S(X) is sufficient if and only if the density can be factored as f(x, θ) = g(s(x), θ)h(x) 155

Example: If X 1,..., X n are iid N(µ, σ 2 ) then the joint density is (2π) n/2 σ n exp{ Xi 2 /(2σ2 ) + µ X i /σ 2 nµ 2 /(2σ 2 )} which is evidently a function of X 2 i, X i This pair is a sufficient statistic. You can write this pair as a bijective function of X, (X i X) 2 so that this pair is also sufficient. Example: If Y 1,..., Y n are iid Bernoulli(p) then f(y 1,..., y p ; p) = p y i(1 p) 1 y i = p yi (1 p) n y i Define g(x, p) = p x (1 p) n x and h 1 to see that X = Y i is sufficient by the factorization criterion. 156

Completeness; Lehman-Scheffé Example: Suppose X has a Binomial(n, p) distribution. The score function is U(p) = 1 p(1 p) X n 1 p CRLB will be strict unless T = cx for some c. If we are trying to estimate p then choosing c = n 1 does give an unbiased estimate ˆp = X/n and T = X/n achieves the CRLB so it is UMVU. Different tactic: Suppose T (X) is some unbiased function of X. Then we have E p (T (X) X/n) 0 because ˆp = X/n is also unbiased. If h(k) = T (k) k/n then E p (h(x)) = n k=0 h(k) ( n) p k (1 p) n k 0 k 157

LHS of sign is polynomial function of p as is RHS. Thus if the left hand side is expanded out the coefficient of each power p k is 0. Constant term occurs only in term k = 0; its coefficient is h(0) ( n) = h(0). 0 Thus h(0) = 0. Now p 1 = p occurs only in term k = 1 with coefficient nh(1) so h(1) = 0. Since terms with k = 0 or 1 are 0 the quantity p 2 occurs only in k = 2 term; coefficient is so h(2) = 0. n(n 1)h(2)/2 Continue to see that h(k) = 0 for each k. So only unbiased function of X is X/n. 158

Completeness In Binomial(n, p) example only one function of X is unbiased. Rao Blackwell shows UMVUE, if it exists, will be a function of any sufficient statistic. Q: Can there be more than one such function? A: Yes in general but no for some models like the binomial. Definition: A statistic T is complete for a model P θ ; θ Θ if E θ (h(t )) = 0 for all θ implies h(t ) = 0. 159

We have already seen that X is complete in the Binomial(n, p) model. In the N(µ, 1) model suppose Since that E(h( X)) = so that E µ (h( X)) 0. X has a N(µ, 1/n) distribution we find ne nµ 2 /2 2π h(x)e nx2 /2 e nµx dx h(x)e nx2 /2 e nµx dx 0. Called Laplace transform of h(x)e nx2 /2. Theorem: Laplace transform is 0 if and only if the function is 0 (because you can invert the transform). Hence h 0. 160

How to Prove Completeness Only one general tactic: suppose X has density f(x, θ) = h(x) exp{ p 1 a i (θ)s i (x) + c(θ)} If the range of the function (a 1 (θ),..., a p (θ)) as θ varies over Θ contains a (hyper-) rectangle in R p then the statistic (S 1 (X),..., S p (X)) is complete and sufficient. You prove the sufficiency by the factorization criterion and the completeness using the properties of Laplace transforms and the fact that the joint density of S 1,..., S p g(s 1,..., s p ; θ) = h (s) exp{ a k (θ)s k + c (θ)} 161

Example: N(µ, σ 2 ) model density has form { ( ) exp 1 2σ x 2 2 + ( ) µ σ x µ 2 } 2 2σ 2 log σ 2π which is an exponential family with h(x) = 1 2π a 1 (θ) = 1 2σ 2 S 1 (x) = x 2 and a 2 (θ) = µ σ 2 S 2 (x) = x c(θ) = µ2 2σ 2 log σ. It follows that ( X 2 i, X i ) is a complete sufficient statistic. 162

Remark: The statistic (s 2, X) is a one to one function of ( X 2 i, X i ) so it must be complete and sufficient, too. Any function of the latter statistic can be rewritten as a function of the former and vice versa. FACT: A complete sufficient statistic is also minimal sufficient. The Lehmann-Scheffé Theorem Theorem: If S is a complete sufficient statistic for some model and h(s) is an unbiased estimate of some parameter φ(θ) then h(s) is the UMVUE of φ(θ). Proof: Suppose T is another unbiased estimate of φ. According to Rao-Blackwell, T is improved by E(T S) so if h(s) is not UMVUE then there must exist another function h (S) which is unbiased and whose variance is smaller than that of h(s) for some value of θ. But E θ (h (S) h(s)) 0 so, in fact h (S) = h(s). 163

Example: In the N(µ, σ 2 ) example the random variable (n 1)s 2 /σ 2 has a χ 2 n 1 distribution. It follows that E [ n 1s σ ] = 0 x 1/2 ( x 2 ) (n 1)/2 1 e x/2 dx 2Γ((n 1)/2) Make the substitution y = x/2 and get. E(s) = Hence σ 2 n 1 Γ((n 1)/2) 0 yn/2 1 e y dy. 2Γ(n/2) E(s) = σ. n 1Γ((n 1)/2) The UMVUE of σ is then n 1Γ((n 1)/2) s 2Γ(n/2) by the Lehmann-Scheffé theorem. 164

Criticism of Unbiasedness UMVUE can be inadmissible for squared error loss meaning there is a (biased, of course) estimate whose MSE is smaller for every parameter value. An example is the UMVUE of φ = p(1 p) which is ˆφ = nˆp(1 ˆp)/(n 1). The MSE of φ = min(ˆφ, 1/4) is smaller than that of ˆφ. Unbiased estimation may be impossible. Binomial(n, p) log odds is φ = log(p/(1 p)). Since the expectation of any function of the data is a polynomial function of p and since φ is not a polynomial function of p there is no unbiased estimate of φ 165

The UMVUE of σ is not the square root of the UMVUE of σ 2. This method of estimation does not have the parameterization equivariance that maximum likelihood does. Unbiasedness is irrelevant (unless you average together many estimators). Property is an average over possible values of the estimate in which positive errors are allowed to cancel negative errors. Exception to criticism: if you average a number of estimators to get a single estimator then it is a problem if all the estimators have the same bias. See assignment 5, one way layout example: mle of the residual variance averages together many biased estimates and so is very badly biased. That assignment shows that the solution is not really to insist on unbiasedness but to consider an alternative to averaging for putting the individual estimates together. 166