MLE and efficiency 23. P (X = x) = θx Let s try to find the MLE for θ. A random sample drawn from this distribution has the likelihood function

Similar documents
Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Unbiased Estimation. February 7-12, 2008

6.3 Testing Series With Positive Terms

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

7.1 Convergence of sequences of random variables

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Stat 421-SP2012 Interval Estimation Section

Problem Set 4 Due Oct, 12

Chapter 6 Principles of Data Reduction

Random Variables, Sampling and Estimation

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

The Growth of Functions. Theoretical Supplement

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Estimation for Complete Data

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Lecture 12: September 27

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Fall 2013 MTH431/531 Real analysis Section Notes

Topic 9: Sampling Distributions of Estimators

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

1 Approximating Integrals using Taylor Polynomials

Solutions: Homework 3

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

4. Partial Sums and the Central Limit Theorem

Lecture 11 and 12: Basic estimation theory

Seunghee Ye Ma 8: Week 5 Oct 28

Statistical Theory MT 2008 Problems 1: Solution sketches

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture 2: Monte Carlo Simulation

Statistical Theory MT 2009 Problems 1: Solution sketches

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Maximum Likelihood Estimation

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

7.1 Convergence of sequences of random variables

6. Sufficient, Complete, and Ancillary Statistics

Lecture 3: August 31

Understanding Samples

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Math 10A final exam, December 16, 2016

Distribution of Random Samples & Limit theorems

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Output Analysis and Run-Length Control

Problem Set 2 Solutions

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Math 113, Calculus II Winter 2007 Final Exam Solutions

CHAPTER I: Vector Spaces

Topic 9: Sampling Distributions of Estimators

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Exponential Families and Bayesian Inference

Statistics 511 Additional Materials

STAT Homework 2 - Solutions

Part I: Covers Sequence through Series Comparison Tests

The standard deviation of the mean

Lecture 6 Ecient estimators. Rao-Cramer bound.

Math 2784 (or 2794W) University of Connecticut

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

1 Introduction to reducing variance in Monte Carlo simulations

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

STAT Homework 1 - Solutions

32 estimating the cumulative distribution function

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Lecture 6: Integration and the Mean Value Theorem. slope =

Topic 9: Sampling Distributions of Estimators

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

REAL ANALYSIS II: PROBLEM SET 1 - SOLUTIONS

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

Mathematical Statistics - MS

5. Likelihood Ratio Tests

The Riemann Zeta Function

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Chapter 6 Infinite Series

10-701/ Machine Learning Mid-term Exam Solution

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

PRACTICE PROBLEMS FOR THE FINAL

Lecture 10 October Minimaxity and least favorable prior sequences

Sequences and Series of Functions

Lecture 2: April 3, 2013

Stat410 Probability and Statistics II (F16)

Chapter 10: Power Series

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

Element sampling: Part 2

Math 113 Exam 3 Practice

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

An Introduction to Randomized Algorithms

Frequentist Inference

Mathematical Induction

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

Transcription:

3. Maximum likelihood estimators ad efficiecy 3.1. Maximum likelihood estimators. Let X 1,..., X be a radom sample, draw from a distributio P θ that depeds o a ukow parameter θ. We are lookig for a geeral method to produce a statistic T = T (X 1,..., X ) that (we hope) will be a reasoable estimator for θ. Oe possible aswer is the maximum likelihood method. Suppose I observed the values x 1,..., x. Before the experimet, the probability that exactly these values would occur was P θ (X 1 = x 1,..., X = x ), ad this will deped o θ. Sice I did observe these values, maybe it s a good idea to look for a θ that maximizes this probability (which, to impress the uiitiated, we ow call likelihood). Please do ot cofuse this maximizatio with the futile attempt to fid the θ that is ow most likely, give what I just observed. I really maximize over the coditio: give that θ has some cocrete value, we ca work out the probability that what I observed occurred, ad this is what I maximize. Exercise 3.1. Please elaborate. Ca you also make it plausible that there are (artificial) examples where the MLE is i fact quite likely to produce a estimate that is hopelessly off target? Defiitio 3.1. We call a statistic θ = θ(x 1,..., X ) a maximum likelihood estimator for θ if P θ (X 1 = x 1,..., X = x ) is maximal at θ = θ(x 1,..., x ). There is, i geeral, o guaratee that this maximum exists or (if it does) is uique, but we ll igore this potetial problem ad just hope for the best. Also, observe that if we take the defiitio apart very carefully, we discover a certai amout of jugglig aroud with argumets of fuctios: the MLE θ is a statistic, that is, a radom variable that is a fuctio of the radom sample, but the maximizig value of the parameter is obtaied by replacig the X j by their observed values x j. Alteratively, we could say that we cosider the likelihood fuctio L(x 1,..., x ) = P (X 1 = x 1,..., X = x ), the plug the radom variables X j ito their ow likelihood fuctio ad fially maximize, which the produces a maximizer that is a radom variable itself (ad i fact a statistic). Noe of this matters a whole lot right ow; we ll ecouter this curious procedure (plug radom variables ito fuctios obtaied from their ow distributio) agai i the ext sectio. Example 3.1. Let s retur to the coi flip example: P (X 1 = 1) = θ, P (X 1 = ) = 1 θ, ad here it s coveiet to combie this ito oe 22

MLE ad efficiecy 23 formula by writig P (X 1 = x) = θ x (1 θ) 1 x, for x =, 1. Thus P (X 1 = x 1,..., X = x ) = θ x j (1 θ) x j. We are lookig for the θ that maximizes this expressio. Take the θ derivative ad set this equal to zero. Also, let s abbreviate S = x j. Sθ S 1 (1 θ) S ( S)θ S (1 θ) S 1 = or S(1 θ) ( S)θ =, ad this has the solutio θ = S/. (We d ow have to check that this is ideed a maximum, but we skip this part.) So the MLE for this distributio is give by θ = T = X. It is reassurig that this obvious choice ow receives some theoretical justificatio. We kow that this estimator is ubiased. I geeral, however, MLEs ca be biased. To see this, let s retur to aother example that was discussed earlier. Example 3.2. Cosider agai the ur with a ukow umber N = θ of balls i it, labeled 1,..., N. We form a radom sample X 1,..., X by drawig times, with replacemet, accordig to the distributio P (X 1 = x) = (1/N)χ 1,...,N (x). For fixed x 1,..., x 1, the probability of observig this outcome is the give by { N max x j N (3.1) P (X 1 = x 1,..., X = x ) = max x j > N. We wat to fid the MLE, so we are tryig to maximize this over N, for fixed x 1,..., x. Clearly, eterig the secod lie of (3.1) is o good, so we must take N max x j. For ay such N, the quatity we re tryig to maximize equals N, so we get the largest possible value by takig the smallest N that is still allowed. I other words, the MLE is give by N = max X j. We kow that this estimator is ot ubiased. Agai, it is ice to see some theoretical justificatio emergig for a estimator that looked reasoable. Example 3.3. Recall that the Poisso distributio with parameter θ > is give by P (X = x) = θx x! e θ, (x =, 1, 2,...). Let s try to fid the MLE for θ. A radom sample draw from this distributio has the likelihood fuctio P (X 1 = x 1,..., X = x ) = θx 1+...+x x 1! x! e θ.

24 Christia Remlig We wat to maximize this with respect to θ, so we ca igore the deomiator, which does ot deped o θ. Let s agai write S = x j ; we the wat to maximize θ S e θ. This leads to or θ = S/, that is θ = X. Sθ S 1 e θ θ S e θ = Exercise 3.2. Show that EX = θ if X is Poisso distributed with parameter θ. Coclude that the MLE is ubiased. For radom samples draw from cotiuous distributios, the above recipe caot literally be applied because P (X 1 = x 1,..., X = x ) = always i this situatio. However, we ca modify it as follows: call a statistic θ a MLE for θ if θ(x 1,..., x ) maximizes the (joit) desity f X1,...,X (x 1,..., x ; θ) = f(x 1 ; θ)f(x 2 ; θ) f(x ; θ), for all possible values x j of the radom sample. I aalogy to our termiology i the discrete case, we will agai refer to this product of the desities as the likelihood fuctio. Example 3.4. Cosider the expoetial distributio with parameter θ; this is the distributio with desity (3.2) f(x) = e x/θ (x ), θ ad f(x) = for x <. Let s first fid EX for a expoetially distributed radom variable X: EX = 1 xe x/θ dx = xe x/θ + e x/θ dx = θ, θ by a itegratio by parts i the first step. (So it is atural to use θ as the parameter, rather tha 1/θ.) To fid the MLE for θ, we have to maximize θ e S/θ (writig, as usual, S = x j ). This gives θ 1 e S/θ + S θ 2 θ e S/θ = or θ = S/, that is, as a statistic, θ = X (agai...). This MLE is ubiased. What would have happeed if he had used η = 1/θ i (3.2) istead, to avoid the reciprocals? So f(x) = ηe ηx for x, ad I ow wat to fid the MLE η for η. I other words, I wat to maximize η e ηs, ad proceedig as above, we fid that this happes at η = /S or η = 1/X. Now recall that η = 1/θ, ad the MLE for θ was θ = X. This is o coicidece; essetially, we solved the same maximizatio problem

MLE ad efficiecy 25 twice, with slightly chaged otatio the secod time. I geeral, we have the followig (almost tautological) statemet: Theorem 3.2. Cosider parameters η, θ that parametrize the same distributio. Suppose that they are related by η = g(θ), for a bijective g. The, if θ is a MLE for θ, the η = g( θ) is a MLE for η. Exercise 3.3. Give a somewhat more explicit versio of the argumet suggested above. Notice, however, that the MLE estimator is o loger ubiased after the trasformatio. This could be checked rather quickly by a idirect argumet, but it is also possible to work thigs out explicitly. To get this started, let s first look at the distributio of the sum S 2 = X 1 + X 2 two idepedet expoetially distributed radom variables X 1, X 2. We kow that the desity of S 2 is the covolutio of the desity from (3.2) with itself: f 2 (x) = 1 x e t/θ e (x t)/θ dt = 1 θ 2 θ 2 xe x/θ Next, if we add oe more idepedet radom variable with this distributio, that is, if we cosider S 3 = S 2 + X 3, the the desity of S 3 ca be obtaied as the covolutio of f 2 with the desity f from (3.2), so f 3 (x) = 1 x te t/θ e (t x)/θ dt = 1 θ 3 2θ 3 x2 e x/θ. Cotiuig i this style, we fid that f (x) = 1 ( 1)!θ x 1 e x/θ. Exercise 3.4. Deote the desity of S = S by f. Show that the S/ has desity f(x) = f (x). Sice X = S/, the Exercise i particular says that X has desity (3.3) f(x) = ( 1)!θ (x) 1 e x/θ (x ). This is already quite iterestig, but let s keep goig. We were origially iterested i Y = 1/X, the MLE for η = 1/θ. We apply the usual techique to trasform the desities: P (Y y) = P (X 1/y) = 1/y f(x) dx,

26 Christia Remlig ad sice g = f Y that ca be obtaied as the y derivative of this, we see (3.4) g(y) = 1 y 2 f(1/y) = ( 1)!θ y 2 (/y) 1 e /(θy) (y > ). This gives EY = = yg(y) dy = ( 1)!θ ( 1)!θ t 2 e t dt. y 1 ( y ) 1 e /(θy) dy We have used the substitutio t = /(θy) to pass to the secod lie. The itegral ca be evaluated by repeated itegratio by parts, or, somewhat more elegatly, you recogize it as Γ( 1) = ( 2)!. So, puttig thigs together, it follows that E(1/X) = ( 1)θ = 1 η. I particular, Y = 1/X is ot a ubiased estimator for η; we are off by the factor /( 1) > 1 (which, however, is very close to 1 for large ). Exercise 3.5. Check oe more time that X is a ubiased estimator for θ, this time by makig use of the desity f from (3.3) to compute EX (i a admittedly rather clumsy way). You ca agai use the fact that Γ(k) = (k 1)! for k = 1, 2,.... Example 3.5. Cosider the uiform distributio o [, θ]: { 1/θ x θ f(x) = otherwise We would like to fid the MLE for θ. We the eed to maximize with respect to θ (for give x 1,..., x ) the likelihood fuctio { θ max x j θ f(x 1 ) f(x ) = max x j > θ. This first of all forces us to take θ max x j, to eter the first lie, ad the θ as small as (still) possible, to maximize θ. Thus θ = max(x 1,..., X ). This estimator is ot ubiased. Exercise 3.6. Why? This whole example is a exact (cotiuous) aalog of its discrete versio Example 3.2.

MLE ad efficiecy 27 Example 3.6. Fially, let s take a look at the ormal distributio. Let s first fid the MLE for θ = σ 2, for a ormal distributio with kow µ. We the eed to maximize θ /2 e A/θ, A = (x j µ) 2. 2 This gives (/2)/θ + A/θ 2 = or θ = 2A/, that is, 1 (3.5) θ = (X j µ) 2. Exercise 3.7. (a) Show that θ/σ 2 χ 2 (). (b) Coclude that θ is ubiased. By Theorem 3.2, the MLE for σ is the give by 1 σ = (Xj µ) 2. This estimator is ot ubiased. What if µ ad σ are both ukow? There is a obvious way to adapt our procedure: we ca maximize over both parameters simultaeously to obtai two statistics that ca serve as MLE style estimators. So we ow wat to maximize ( ) θ /2 exp 1 (x j µ) 2 2θ over both µ ad θ. This leads to the two coditios 2θ + 1 (x 2θ 2 j µ) 2 =, (x j µ) =. The secod equatio says that µ = (1/) x j =: x, ad the, by repeatig the calculatio from above, we see from this ad the first equatio that θ = (1/) (x j x) 2. I other words, 1 µ = X, θ = (X j X) 2 = 1 S2. So µ is ubiased, but θ is ot sice ES 2 = σ 2 = θ, so E θ = (( 1)/)θ. Exercise 3.8. Fid the MLE for θ for the followig desities: (a) f(x) = θx θ 1 for < x < 1, ad f(x) = otherwise, ad θ > ; (b) f(x) = e θ x for x θ ad f(x) = otherwise

28 Christia Remlig Exercise 3.9. Here s a example where the maximizatio does ot produce a uique value. Cosider the desity f(x) = (1/2)e x θ. Assume for coveiece that = 2k is eve ad cosider data x 1 < x 2 <... < x. The show that ay θ i the iterval x k < θ < x k+1 maximizes the likelihood fuctio. Exercise 3.1. (a) Show that f(x, θ) = 1 θ 2 xe x/θ (x ) (ad f(x) = for x < ) is a desity for θ >. (b) Fid the MLE θ for θ. (c) Show that θ is ubiased. 3.2. Cramer-Rao bouds. If a estimator is ubiased, it delivers the correct value at least o average. It would the be ice if this estimator showed oly little variatio about this correct value (of course, if T is biased, it is less clear if little variatio about the icorrect value is a good thig). Let s take aother look at our favorite example from this poit of view. So P (X 1 = 1) = θ, P (X 1 = ) = 1 θ, ad we are goig to use the MLE T = θ = X. Sice the X j are idepedet, the variaces add up ad thus Var(T ) = 1 Var(X θ(1 θ) 1) = 2 ad σ T = θ(1 θ)/ 1/(2 ). This does t look too bad. I particular, for large radom samples, it gets small; it decays at the rate σ T 1/. Could we perhaps do better tha this with a differet ubiased estimator? It turs out that this is ot the case. The statistic T = X is optimal i this example i the sese that it has the smallest possible variace amog all ubiased estimators. We ow derive such a result i a geeral settig. Let f(x, θ) be a desity that depeds o the parameter θ. We will assume throughout this sectio that f is sufficietly well behaved so that the followig maipulatios are justified, without actually makig explicit a precise versio of such assumptios. We will certaily eed f to be twice differetiable with respect to θ sice we will take this secod derivative, but this o its ow is ot sufficiet to justify some of the other steps (such as differetiatig uder the itegral sig). We have that f dx = 1, so by takig the θ derivative (ad iterchagig differetiatio ad itegral), we obtai that f/ θ dx =.

MLE ad efficiecy 29 This we may rewrite as (3.6) f(x, θ) l f(x, θ) dx =. θ There are potetial problems here with regios where f = ; to avoid these, I will simply iterpret (3.6) as a itegral over oly those parts of the real lie where f >. (To make sure that the argumet leadig to (3.6) is still justified i this settig, we should really make the additioal assumptio that {x : f(x, θ) > } does ot deped o θ, but we ll igore purely techical poits of this kid.) A alterative readig of (3.6) is E( / θ) l f(x, θ) =. Here (ad below) I use the geeral fact that Eg(X) = g(x)f(x) dx for ay fuctio g. Also ote the somewhat curious costructio here: we plug the radom variable X ito its ow desity (ad the take the logarithm) to produce the ew radom variable l f(x) (which also depeds o θ). If we take oe more derivative, the (3.6) becomes (3.7) ( ) 2 f(x, θ) 2 l f(x, θ) dx + f(x, θ) l f(x, θ) dx =. θ2 θ Defiitio 3.3. The Fisher iformatio is defied as I(θ) = E ( ) 2 l f(x, θ). θ This assumes that X is a cotiuous radom variable; i the discrete case, we replace f by P (X = x, θ) (ad agai plug X ito its ow distributio). From (3.7), we obtai the alterative formula (3.8) I(θ) = E 2 l f(x, θ); θ2 moreover, it is also true that (3.9) I(θ) = Var(( / θ) l f(x, θ)). Example 3.7. Let s retur oe more time to the coi flip example: P (X = x) = θ x (1 θ) 1 x (x =, 1), so l P = x l θ + (1 x) l(1 θ) ad (3.1) θ l P = x θ 1 x 1 θ.

3 Christia Remlig To fid the Fisher iformatio, we plug X ito this fuctio ad take the square. This produces X 2 (1 X)2 + θ2 (1 θ) X) 2X(1 = 2 θ(1 θ) X 2 ( 1 θ 2 + 1 (1 θ) 2 + 2 θ(1 θ) ) 2X ( ) 1 (1 θ) + 1 1 + 2 θ(1 θ) (1 θ). 2 Now recall that EX = EX 2 = θ, ad take the expectatio. We fid that I(θ) = θ(1 θ)2 + θ 3 + 2θ 2 (1 θ) 2θ 3 2θ 2 (1 θ) + θ 2 θ 2 (1 θ) 2 1 = θ(1 θ). Alteratively, we could have obtaied the same result more quickly from (3.8). Take oe more derivative i (3.1), plug X ito the resultig fuctio ad take the expectatio: I(θ) = E ( X θ 2 1 X (1 θ) 2 ) = 1 θ + 1 1 θ = 1 θ(1 θ) Example 3.8. Cosider the N(θ, 1) distributio. Its desity is give by f = (2π) 1/2 e (x θ)2 /2, so l f = (x θ) 2 /2 + C. Two differetiatios produce ( 2 / θ 2 ) l f = 1, so I = 1. Whe dealig with a radom sample X 1,..., X, Defiitio 3.3 ca be adapted by replacig f by what we called the likelihood fuctio i the previous sectio. More precisely, we could replace (3.9) with ( ) Var θ l L(X 1,..., X ; θ), where L(x 1,..., x ) = f(x 1 ) f(x ) (cotiuous case) or L(x 1,..., x ) = P (X 1 = x 1,..., X = x ) (discrete case). The, however, we ca use the product structure of L ad idepedece to evaluate (i the cotiuous case, say) Var ( θ l f(x j, θ) ) = ( ) Var θ l f(x j, θ) = I(θ), where ow I is the Fisher iformatio of a idividual radom variable X. A aalogous calculatio works i the discrete case.

MLE ad efficiecy 31 Theorem 3.4 (Cramer-Rao). Let T = T (X 1,..., X ) be a statistic ad write k(θ) = ET. The, uder suitable (smoothess) assumptios, Var(T ) (k (θ)) 2 I(θ). Corollary 3.5. If the statistic T i Theorem 3.4 is ubiased, the Var(T ) 1 I(θ). As a illustratio, let s agai look at the coi flip example with its MLE T = θ = X. We saw earlier that Var(T ) = θ(1 θ)/, ad this equals 1/(I) by our calculatio from Example 3.7. Sice T is also ubiased, this meas that this estimator achieves the Cramer-Rao boud from Corollary 3.5. We give a special ame to estimators that are optimal, i this sese: Defiitio 3.6. Let T be a ubiased estimator for θ. efficiet if T achieves the CR boud: Var(T ) = 1 I(θ) We call T So we ca summarize by sayig that X is a efficiet estimator for θ. Let s ow try to derive the CR boud. I ll do this for cotiuous radom variables, with desity f(x, θ). The k(θ) = dx 1 dx 2... dx T (x 1,..., x )f(x 1, θ) f(x, θ) ad thus (at least if we are allowed to freely iterchage differetiatios ad itegrals) k (θ) = dx 1 dx 2... dx T (x 1,..., x ) = = ET Z, f(x 1, θ) f(x j, θ) f(x, θ) θ dx 1 dx 2... dx T (x 1,..., x ) ( ) θ l f(x j, θ) f(x 1, θ) f(x, θ)

32 Christia Remlig where we have abbreviated Z = ( / θ) l f(x j, θ). We kow that EZ = (compare (3.6)) ad Var(Z) = I, by idepedece of the X j. We will ow eed the followig very importat ad fudametal tool: Exercise 3.11. Establish the Cauchy-Schwarz iequality: For ay two radom variables X, Y, we have that EXY ( EX 2) 1/2 ( EY 2 ) 1/2. Suggestio: Cosider the parabola f(t) = E(X + ty ) 2 ad fid its miimum. Exercise 3.12. Ca you also show that we have equality i the CSI precisely if X = cy or Y = cx for some c R? Exercise 3.13. Defie the correlatio coefficiet of two radom variables X, Y as E(X EX)(Y EY ) ρ X,Y =. σ X σ Y Deduce from the CSI that 1 ρ 1. Also, show that ρ = if X, Y are idepedet. (The coverse of this statemet is ot true, i geeral.) Sice EZ =, we ca write k (θ) = ET Z = E(T ET )Z = E(T ET )(Z EZ), ad ow the CSI shows that as claimed. k 2 Var(T )Var(Z) = I(θ)Var(T ), Exercise 3.14. Observe that the iequality was oly itroduced i the very last step. Thus, by Exercise 3.12, we have equality i the CR boud precisely if T ET ad Z are multiples of oe aother. I particular, this must hold for the efficiet statistic T = X from the coi flip example. Cofirm directly that ideed X θ = cz. Example 3.9. We saw i Example 3.4 that the MLE for the expoetial distributio f(x) = e x/θ /θ (x ) is give by T = θ = X ad that T is ubiased. Is T also efficiet? To aswer this, we compute the Fisher iformatio: l f = l θ x/θ, so 2 l f/ θ 2 = 1/θ 2 + 2X/θ 3, ad, takig expectatios, we see that I = 1/θ 2. O the other had, Var(T ) = (1/)Var(X 1 ) ad EX 2 1 = 1 θ x 2 e x/θ dx = θ 2 t 2 e t dt = 2θ 2,

MLE ad efficiecy 33 by two itegratios by parts. This implies that Var(X 1 ) = EX1 2 (EX 1 ) 2 = θ 2, ad thus Var(T ) = θ 2 / = 1/(I), ad T is ideed efficiet. Let s ow take aother look at the uiform distributio from Example 3.5. Its desity equals { 1/θ < x < θ f(x, θ) = otherwise ; recall that the MLE is give by θ = max(x 1,..., X ). We kow that T = θ is ot ubiased. Let s try to be more precise here. Sice P (T t) = (t/θ), the statistic T has desity f(t) = t 1 /θ ( < t < θ). It follows that ET = θ t dt = θ + 1 θ. Exercise 3.15. Show by a similar calculatio that ET 2 = (/(+2))θ 2. I particular, if we itroduce U = + 1 T = + 1 max(x 1,..., X ), the this ew statistic is ubiased (though it is o loger the MLE for θ). By the exercise, so (3.11) Var(U) = ( ) 2 + 1 EU 2 = ET 2 = ( ) ( + 1) 2 ( + 2) 1 θ 2 = ( + 1)2 ( + 2) θ2, θ 2 ( + 2). This looks great! I our previous examples, the variace decayed oly at the rate 1/, ad here we ow have that Var(U) 1/ 2. Come to thik of it, is this cosistet with the CR boud? Does t Corollary 3.5 say that Var(T ) 1/ for ay ubiased statistic T? The aswer to this is that the whole theory does t apply here. The desity f(x, θ) is ot cotiuous (let aloe differetiable) as a fuctio of θ; it jumps at θ = x. I fact, the problems ca be pipoited more precisely: (3.6) fails, the itegrad equals 1/θ 2, ad (3.6) was used to deduce that EZ =, so the whole argumet breaks dow. Recall that by our discussio followig (3.6), the itegratio i (3.6) is really oly exteded over < x < θ, so problems with the jump of f are temporarily avoided. (However, I also remarked parethetically that I would like the set {x : f(x, θ) > } to be idepedet of θ, ad this clearly fails here.)

34 Christia Remlig Let s compare U with aother ubiased estimator. Let V = 2X. Sice EX = EX 1 = θ/2, this is ideed ubiased. It is a cotiuous aalog of the ubiased estimator that we suggested (ot very seriously, though) i the ur example from Chapter 2; see pg. 1. We have that Var(X) = Var(X 1 )/ ad EX 2 1 = 1 θ θ so Var(X 1 ) = θ 2 (1/3 1/4) = θ 2 /12, thus t 2 dt = θ2 3, Var(V ) = θ2 3. This is markedly iferior to (3.11). We right away had a bad feelig about V (i Chapter 2); this ow receives precise theoretical cofirmatio. Exercise 3.16. However, if = 1, the Var(V ) = Var(U). Ca you explai this? Exercise 3.17. Cosider the desity { 2x/θ 2 x θ f(x, θ) = otherwise. (a) Fid the MLE θ. (b) Show that T = 2+1 θ is ubiased. 2 (c) Fid Var(T ). Suggestio: Proceed as i the discussio above. Example 3.1. Let s retur to the MLE T = θ = X for the Poisso distributio; compare Example 3.3. We saw earlier that this is ubiased. Is T also efficiet? To aswer this, we first work out the Fisher iformatio: l P (X = x, θ) = x l θ + θ + l x!, so by takig two derivatives ad the the expectatio, we fid that I(θ) = EX/θ 2 = 1/θ. O the other had, EX 2 1 = k= k 2 θk k! e θ = θ 2 k= θ k k! e θ + EX 1 = θ 2 + θ; the first step follows by writig k 2 = k(k 1) + k. Thus Var(X 1 ) = θ, hece Var(T ) = θ/, ad T is efficiet. Exercise 3.18. I this problem, you should frequetly refer to results ad calculatios from Example 3.4. Cosider the desity f(x, θ) =

MLE ad efficiecy 35 θe θx (x ) ad f(x) = for x <. Recall that T = 1 Y, Y = 1/X is a ubiased estimator for θ. (a) Fid the Fisher iformatio I(θ) for this desity. (b) Compute Var(T ); coclude that T is ot efficiet. (Later we will see that T evertheless has the smallest possible variace amog all ubiased estimators.) Suggestio: Use the desity of Y from (3.4) to work out EY 2, ad the ET 2 ad Var(T ). Avoid the trap of forgettig that the θ of the preset exercise correspods to 1/θ i (3.4). Example 3.11. Let s ow try to estimate the variace of a N(, σ) distributio. We take θ = σ 2 as the parameter labelig this family of desities. Two ubiased estimators come to mid: T 1 = 1 X 2 j, T 2 = S 2 = 1 1 ( Xj X ) 2 We kow from Example 3.6 that T 1 is the MLE for θ; see (3.5). We start out by computig the Fisher iformatio. We have that l f = (1/2) l θ + X 2 /(2θ) + C, so I(θ) = 1 2θ 2 + 1 θ 3 EX2 = 1 2θ 2. Next, idepedece gives that Var(T 1 ) = (1/)Var(X 2 1), ad this latter variace we compute as EX 4 1 (EX 2 1) 2. Exercise 3.19. Show that EX 4 1 = 3θ 2. Suggestio: Use itegratio by parts i the resultig itegral. Sice EX1 2 = Var(X 1 ) = θ, this shows that Var(X1) 2 = 2θ 2 ad thus Var(T 1 ) = 2θ 2 /. So T 1 is efficiet. As for T 2, we recall that ( 1)S 2 /θ χ 2 ( 1) ad also that this is the distributio of the sum of 1 iid N(, 1)-distributed radom variables. I other words, ( 1)S 2 /θ has the same distributio as Z = 1 Y j 2, with Y j iid ad Y j N(, 1). I particular, the variaces agree, ad Var(Z) = ( 1)Var(Y1 2 ) = 2( 1), by the calculatio we just did. Thus θ 2 Var(S 2 ) = 2( 1) ( 1) = 2θ2 2 1, ad this estimator is ot efficiet (it comes very close though).

36 Christia Remlig If we had used istead of the slightly uexpected 1 i the deomiator of the formula defiig S 2, the resultig estimator Y 3 = S2 has variace 1 2( 1)θ2 (3.12) Var(Y 3 ) = = 1 1 2 I(θ). This, of course, does ot cotradict the CR boud from Corollary 3.5: this estimator is ot ubiased. O the cotrary, everythig is i perfect order, we oly eed to refer to Theorem 3.4, which hadles this situatio. Sice k(θ) = EY 3 = ( 1)θ/, we have that k 2 = (( 1)/) 2, ad the variace from (3.12) is i fact slightly larger (by a factor of /( 1)) tha the lower boud provided by the theorem. Exercise 3.2. Cosider a radom sample draw from a N(θ, 1) distributio. Show that (the MLE) X is a efficiet estimator for θ.