Maximum Likelihood Estimation

Similar documents
Maximum Likelihood Estimation

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

Chapter 3. Point Estimation. 3.1 Introduction

Lecture 1: Introduction

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics

1. Fisher Information

The Delta Method and Applications

Chapter 3 : Likelihood function and inference

5601 Notes: The Sandwich Estimator

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Lecture 3 September 1

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Statistics 3858 : Maximum Likelihood Estimators

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Graduate Econometrics I: Maximum Likelihood I

Fisher Information & Efficiency

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Z-estimators (generalized method of moments)

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

ECE 275B Homework # 1 Solutions Winter 2018

ECE 275B Homework # 1 Solutions Version Winter 2015

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

A Very Brief Summary of Statistical Inference, and Examples

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Introduction to Estimation Methods for Time Series models Lecture 2

DA Freedman Notes on the MLE Fall 2003

Information in a Two-Stage Adaptive Optimal Design

Stat 5101 Lecture Notes

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Maximum Likelihood Estimation

Classical Estimation Topics

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Sampling distribution of GLM regression coefficients

Review and continuation from last week Properties of MLEs

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Statistical Inference

5.2 Fisher information and the Cramer-Rao bound

MCMC algorithms for fitting Bayesian models

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

CSC 2541: Bayesian Methods for Machine Learning

simple if it completely specifies the density of x

Multivariate Analysis and Likelihood Inference

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Statistical Estimation

Lecture 4 September 15

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Statistics & Data Sciences: First Year Prelim Exam May 2018

Lecture 3 January 16

ECE531 Lecture 10b: Maximum Likelihood Estimation

Loglikelihood and Confidence Intervals

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

Lecture 7 October 13

Likelihood-Based Methods

STAT 730 Chapter 4: Estimation

Conjugate Priors, Uninformative Priors

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Econometrics I, Estimation

A Very Brief Summary of Bayesian Inference, and Examples

Linear Methods for Prediction

The Multivariate Gaussian Distribution [DRAFT]

Chapter 4: Asymptotic Properties of the MLE

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Invariant HPD credible sets and MAP estimators

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Master s Written Examination

Spring 2012 Math 541B Exam 1

Mathematical statistics

Ch. 5 Hypothesis Testing

POLI 8501 Introduction to Maximum Likelihood Estimation

Parametric Inference

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

3.1 General Principles of Estimation.

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

9 Asymptotic Approximations and Practical Asymptotic Tools

A Few Notes on Fisher Information (WIP)

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

i=1 h n (ˆθ n ) = 0. (2)

A Very Brief Summary of Statistical Inference, and Examples

Submitted to the Brazilian Journal of Probability and Statistics

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

An Extended BIC for Model Selection

STAT215: Solutions for Homework 2

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

x log x, which is strictly convex, and use Jensen s Inequality:

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Expectation Maximization

Greene, Econometric Analysis (6th ed, 2008)

Closest Moment Estimation under General Conditions

Checking for Prior-Data Conflict

Transcription:

Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function of θ is called the likelihood function of θ. We often denote this function by L(θ). Note that L(θ) = f θ (X) is implicitly a function of X, but we suppress this fact in the notation. Since repeated references to the density or mass function would be awkward, we will use the term density to refer to f θ (x) throughout this chapter, even if the distribution function of X is not continuous. (Allowing noncontinuous distributions to have density functions may be made technically rigorous; however, this is a measure theoretic topic beyond the scope of this book.) Let the set of possible values of θ be the set Ω. If L(θ) has a maximizer in Ω, say ˆθ, then ˆθ is called a maximum likelihood estimator or MLE. Since the logarithm function is a strictly increasing function, any maximizer of L(θ) also maximizes l(θ) def = log L(θ). It is often much easier to maximize l(θ), called the loglikelihood function, than L(θ). Example 7. so Suppose Ω = (0, ) and X binomial(n, e θ ). Then ( ) n l(θ) = log Xθ + (n X) log( e θ ), X l (θ) = X + X n e θ. Thus, setting l (θ) = 0 yields θ = log(x/n). It isn t hard to verify that l (θ) < 0, so that log(x/n) is in fact a maximizer of l(θ). 35

As the preceding example demonstrates, it is not always the case that a MLE exists for if X = 0 or X = n, then log(x/n) is not contained in Ω. This is just one of the technical details that we will consider. Ultimately, we will show that the maximum likelihood estimator is, in many cases, asymptotically normal. However, this is not always the case; in fact, it is not even necessarily true that the MLE is consistent, as shown in Problem 7.. We begin the discussion of the consistency of the MLE by defining the so-called Kullback- Leibler divergence. Definition 7.2 If f θ0 (x) and f θ (x) are two densities, the Kullback-Leibler divergence from f θ0 to f θ equals K(f θ0, f θ ) = E θ0 log f θ 0 (X) f θ (X). If P θ0 (f θ0 (X) > 0 and f θ (X) = 0) > 0, then K(f θ0, f θ ) is defined to be. The Kullback-Leibler divergence is sometimes called the Kullback-Leibler information number or the relative entropy of f θ with respect to f θ0. Although it is nonnegative, and takes the value zero if and only if f θ (x) = f θ0 (x) except possibly for a set of x values having measure zero, The K-L divergence is not a true distance because K(f θ0, f θ ) is not necessarily the same as K(f θ, f θ0 ). We may show that the Kullback-Leibler information must be nonnegative by noting that E θ0 f θ (X) f θ0 (X) = E θ I{f θ0 (X) > 0}. Therefore, by Jensen s inequality (.37) and the strict convexity of the function log x, K(f θ0, f θ ) = E θ0 log f θ (X) f θ0 (X) log E θ 0 f θ (X) f θ0 (X) 0, (7.) with equality if and only if P θ0 {f θ0 (X) = f θ (X)} =. Inequality (7.) is sometimes called the Shannon-Kolmogorov information inequality. In the (admittedly somewhat bizarre) case in which the parameter space Ω contains only finitely many points, the Shannon-Kolmogorov information inequality may be used to prove the consistency of the maximum likelihood estimator. For the proof of the following theorem, note that if X,..., X n are independent and identically distributed with density f θ0 (x), then the loglikelihood is l(θ) = n i= log f θ 0 (x i ). Theorem 7.3 Suppose Ω contains finitely many elements and that X,..., X n are independent and identically distributed with density f θ0 (x). Furthermore, suppose that the model parameter is identifiable, which is to say that different values of θ lead to different distributions. Then if ˆθ n denotes the maximum likelihood estimator, ˆθ P n θ 0. 36

Proof: The Weak Law of Large Numbers (Theorem 2.9) implies that n n i= log f θ(x i ) P E θ0 log f θ(x i ) f θ0 (X i ) f θ0 (X i ) = K(f θ 0, f θ ) (7.2) for all θ Ω. The value of K(f θ0, f θ ) is strictly negative for θ θ 0 by the identifiability of θ. Therefore, since θ = ˆθ n is the maximizer of the left hand side of Equation (7.2), ( ) n P (ˆθ n θ 0 ) = P max log f θ(x i ) θ θ 0 n f θ0 (X i ) > 0 ( ) n P log f θ(x i ) n f θ0 (X i ) > 0 0. θ θ 0 This implies that ˆθ n P θ 0. i= The result of Theorem 7.3 may be extended in several ways; however, it is unfortunately not true in general that a maximum likelihood estimator is consistent, as demonstrated by the example of Problem 7.. If we return to the simple Example 7., we found that the MLE was found by solving the equation i= l (θ) = 0. (7.3) Equation (7.3) is called the likelihood equation, and naturally a root of the likelihood equation is a good candidate for a maximum likelihood estimator. However, there may be no root and there may be more than one. It turns out the probability that at least one root exists goes to as n. Consider Example 7., in which no MLE exists whenever X = 0 or X = n. In that case, both P (X = 0) = ( e θ ) n and P (X = n) = e nθ go to zero as n. In the case of multiple roots, one of these roots is typically consistent for θ 0, as stated in the following theorem. Theorem 7.4 Suppose that X,..., X n are independent and identically distributed with density f θ0 (x) for θ 0 in an open interval Ω R, where the parameter is identifiable (i.e., different values of θ Ω give different distributions). Furthermore, suppose that the loglikelihood function l(θ) is differentiable and that the support {x : f θ (x) > 0} does not depend on θ. Then with probability approaching as n, there exists ˆθ n = ˆθ n (X,..., X n ) such that l (ˆθ n ) = 0 and ˆθ n P θ 0. Stated succinctly, Theorem 7.4 says that under certain regularity conditions, there is a consistent root of the likelihood equation. It is important to note that there is no guarantee that this consistent root is the MLE. However, if the likelihood equation only has a single root, we can be more precise: 37

Corollary 7.5 Under the conditions of Theorem 7.4, if for every n there is a unique root of the likelihood equation, and this root is a local maximum, then this root is the MLE and the MLE is consistent. Proof: The only thing that needs to be proved is the assertion that the unique root is the MLE. Denote the unique root by ˆθ n and suppose there is some other point θ such that l(θ) l(ˆθ n ). Then there must be a local minimum between ˆθ n and θ, which contradicts the assertion that ˆθ n is the unique root of the likelihood equation. Exercises for Section 7. Exercise 7. In this problem, we explore an example in which the MLE is not consistent. Suppose that for θ (0, ), X is a continuous random variable with density f θ (x) = θg(x) + θ ( ) x θ δ(θ) h, δ(θ) (7.4) where δ(θ) > 0 for all θ, g(x) = I{ < x < }/2, and h(x) = 3( x2 ) I{ < x < }. 4 (a) What condition on δ(θ) ensures that {x : f θ (x) > 0} does not depend on θ? (b) With δ(θ) = exp{ ( θ) 4 }, let θ =.2. Take samples of sizes n {50, 250, 500} from f θ (x). In each case, graph the loglikelihood function and find the MLE. Also, try to identify the consistent root of the likelihood equation in each case. Hints: To generate a sample from f θ (x), note that f θ (x) is a mixture density, which means you can start by generating a standard uniform random variable. If it s less than θ, generate a uniform variable on (, ). Otherwise, generate a variable with density 3(δ 2 x 2 )/4δ 3 on ( δ, δ) and then add θ. You should be able to do this by inverting the distribution function or by using appropriately scaled and translated beta(2, 2) variables. If you use the inverse distribution function method, verify that { 4π H (u) = 2 cos 3 + } 3 cos ( 2u). Be very careful when graphing the loglikelihood and finding the MLE. In particular, make sure you evaluate the loglikelihood analytically at each of the sample 38

points in (0, ) and incorporate these analytical calculations into your code; if you fail to do this, you ll miss the point of the problem and you ll get the MLE incorrect. This is because the correct loglikelhood graph will have tall, extremely narrow spikes. Exercise 7.2 In the situation of Exercise 7., prove that the MLE is inconsistent. Exercise 7.3 Suppose that X,..., X n are independent and identically distributed with density f θ (x), where θ (0, ). For each of the following forms of f θ (x), prove that the likelihood equation has a unique solution and that this solution maximizes the likelihood function. (a) Weibull: For some constant a > 0, f θ (x) = aθ a x a exp{ (θx) a }I{x > 0} (b) Cauchy: f θ (x) = θ π x 2 + θ 2 (c) f θ (x) = 3θ2 3 I{x > 0} 2π(x 3 + θ 3 ) Exercise 7.4 Find the MLE and its asymptotic distribution given a random sample of size n from f θ (x) = ( θ)θ x, x = 0,, 2,..., θ (0, ). Hint: For the asymptotic distribution, use the central limit theorem. 7.2 Asymptotic normality of the MLE As seen in the preceding section, the MLE is not necessarily even consistent, let alone asymptotically normal, so the title of this section is slightly misleading however, Asymptotic normality of the consistent root of the likelihood equation is a bit too long! It will be necessary to review a few facts regarding Fisher information before we proceed. Definition 7.6 Fisher information: For a density (or mass) function f θ (x), the Fisher information function is given by I(θ) = E θ { d dθ log f θ(x)} 2. (7.5) 39

If η = g(θ) for some invertible and differentiable function g( ), then since by the chain rule, we conclude that d dη = dθ d dη dθ = g (θ) I(η) = d dθ I(θ) {g (θ)} 2. (7.6) Loosely speaking, I(θ) is the amount of information about θ contained in a single observation from the density f θ (x). However, this interpretation doesn t always make sense for example, it is possible to have I(θ) = 0 for a very informative observation. See Exercise 7.5. Although we do not dwell on this fact in this course, expectation may be viewed as integration. Suppose that f θ (x) is twice differentiable with respect to θ and that the operations of differentiation and integration may be interchanged in the following sense: d dθ E f θ (X) θ f θ (X) = E θ d f dθ θ(x) f θ (X) (7.7) and d 2 dθ E f θ (X) 2 θ f θ (X) = E θ d 2 dθ 2 f θ (X) f θ (X). (7.8) Equations (7.7) and (7.8) give two additional expressions for I(θ). From Equation (7.7) follows { } d I(θ) = Var θ dθ log f θ(x), (7.9) and Equation (7.8) implies { } d 2 I(θ) = E θ dθ log f θ(x). (7.0) 2 In many cases, Equation (7.0) is the easiest form of the information to work with. Equations (7.9) and (7.0) make clear a helpful property of the information, namely that for independent random variables, the information about θ contained in the joint sample is simply the sum of the individual information components. In particular, if we have a simple random sample of size n from f θ (x), then the information about θ equals ni(θ). 40

The reason that we need the Fisher information is that we will show that under certain assumptions (often called regularity conditions ), { } n(ˆθn θ 0 ) d N 0,, (7.) I(θ 0 ) where ˆθ n is the consistent root of the likelihood equation guaranteed to exist by Theorem 7.4. Example 7.7 Suppose that X,..., X n are independent Poisson(θ 0 ) random variables. Then the likelihood equation has a unique root, namely ˆθ n = X n, and we know that by the central limit theorem n(ˆθ n θ 0 ) d N(0, θ 0 ). Furthermore, the Fisher information for a single observation in this case is { } d E θ dθ log f θ(x) Thus, in this example, Equation (7.) holds. = E θ X θ 2 = θ. Rather than stating all of the regularity conditions necessary to prove Equation (7.), we work backwards, figuring out the conditions as we go through the steps of the proof. The first step is to expand l (ˆθ n ) in a Taylor series around θ 0. Let us introduce the notation l i (θ) to denote the contribution to the loglikelihood from the ith observation; that is, l i (θ) = log f θ (X i ). Thus, we obtain l(θ) = n l i (θ). i= For the Taylor expansion of l (ˆθ n ), let e i (ˆθ n, θ 0 ) denote the remainder for a first-order expansion of l i(ˆθ n ). That is, we define e i (ˆθ n, θ 0 ) so that Summing over i, we obtain l i(ˆθ n ) = l i(θ 0 ) + (ˆθ n θ 0 )l i (θ 0 ) + e i (ˆθ n, θ 0 ). l (ˆθ n ) = l (θ 0 ) + (ˆθ n θ 0 ) [l (θ 0 ) + E n ], (7.2) where E n = n i= e i(θ n, θ 0 )/(ˆθ n θ 0 ), or, in the event ˆθ n = θ 0, E n = 0. (Remember, ˆθ n and E n are random variables.) Rewriting Equation (7.2) gives n(ˆθn θ 0 ) = n{l (ˆθ n ) l (θ 0 )} l (θ 0 ) + E n = 4 n {l (θ 0 ) l (ˆθ n )} n l (θ 0 ) n E n. (7.3)

Let s consider the pieces of Equation (7.3) individually. If Equation (7.7) holds and I(θ 0 ) <, then l (θ 0 ) = ( ) n d n n n dθ log f d θ 0 (X i ) N{0, I(θ 0 )} i= by the central limit theorem and Equation (7.9). If Equation (7.8) holds, then n l (θ 0 ) = n n i= d 2 dθ 2 log f θ 0 (X i ) P I(θ 0 ) by the weak law of large numbers and Equation (7.0). And we relied on the conditions of Theorem 7.4 to guarantee the existence of ˆθ n such that l (ˆθ n ) = 0 with probability approaching one and ˆθ n θ 0 (do you see where we used the latter fact?). Finally, we need a condition that ensures that E n n 0. One way this is often done is as follows: If we assume that the third derivative l i (θ) exists and is uniformly bounded in a neighborhood of θ 0, say by the constant K 0, we may write the Taylor theorem remainder e i (ˆθ n, θ 0 ) in the form of equation (.7) to obtain n E n = n n e i (ˆθ n, θ 0 ) = ˆθ n θ 0 ˆθ n θ 0 2n l i (θin), i= where each θin is between ˆθ n and θ 0. Therefore, since ˆθ P n θ 0, we know that with probability approaching as n, n E n ˆθ n θ 0 n K 0 = K 0 2n 2 (ˆθ n θ 0 ), which means that E n n 0. P i= In conclusion, if all of the our assumptions hold, then the numerator of (7.3) converges in distribution to N{0, I(θ 0 )} by Slutsky s theorem. Furthermore, the denominator in (7.3) converges to I(θ 0 ), so a second use of Slutsky s theorem gives the following theorem. Theorem 7.8 Suppose that the conditions of Theorem 7.4 are satisfied, and let ˆθ n denote a consistent root of the likelihood equation. Assume also that Equations (7.7) and (7.8) hold and that 0 < I(θ 0 ) <. Finally, assume that l i (θ) has three derivatives in some neighborhood of θ 0 and that l i (θ) is uniformly bounded in this neighborhood. Then { } n(ˆθn θ 0 ) d N 0,. I(θ 0 ) 42 P

Sometimes, it is not possible to find an exact zero of l (θ). One way to get a numerical approximation to a zero of l (θ) is to use Newton s method, in which we start at a point θ 0 and then set θ = θ 0 l (θ 0 ) l (θ 0 ). (7.4) Ordinarily, after finding θ we would set θ 0 equal to θ and apply Equation (7.4) iteratively. However, we may show that by using a single step of Newton s method, starting from a n- consistent estimator of θ 0, we may obtain an estimator with the same asymptotic distribution as ˆθ n. A n-consistent estimator is an estimator of θ 0, say θ n, with the property that n( θn θ 0 ) is bounded in probability. For the full definition of bounded in probability, refer to Exercise 2.2, but a sufficient condition is that n( θ n θ 0 ) converges in distribution to any random variable. The proof of the following theorem is left as an exercise: Theorem 7.9 Suppose that θ n is any n-consistent estimator of θ 0. Then under the conditions of Theorem 7.8, if we set δ n = θ n l ( θ n ) l ( θ n ), (7.5) then n(δn θ 0 ) d N ( ) 0,. I(θ 0 ) Exercises for Section 7.2 Exercise 7.5 Suppose that X is a normal random variable with mean θ 3 and known variance σ 2. Calculate I(θ), then argue that the Fisher information can be zero in a case in which there is information about θ in the observed value of X. Exercise 7.6 (a) Show that under the conditions of Theorem 7.8, then if ˆθ n is a consistent root of the likelihood equation, P θ0 (ˆθ n is a local maximum). (b) Using the result of part (a), show that for any two sequences ˆθ n and ˆθ 2n of consistent roots of the likelihood equation, P θ0 (ˆθ n = ˆθ 2n ). Exercise 7.7 Prove Theorem 7.9. 43

Hint: Start with n(δ n θ 0 ) = n(δ n θ n ) + n( θ n θ 0 ), then expand l ( θ n ) in a Taylor series about θ 0 and substitute the result into Equation (7.5). After simplifying, use the result of Exercise 2.2 along with arguments similar to those leading up to Theorem 7.8. Exercise 7.8 Suppose that the following is a random sample from a logistic density with distribution function F θ (x) = (+exp{θ x}) (I ll cheat and tell you that I used θ = 2.).0944 6.4723 3.80 3.838 4.262.2853.0439.7472 4.9483.700.0422 0.690 3.6 0.9970 2.9438 (a) Evaluate the unique root of the likelihood equation numerically. Then, taking the sample median as our known n-consistent estimator θ n of θ, evaluate the estimator δ n in Equation (7.5) numerically. (b) Find the asymptotic distributions of n( θ n 2) and n(δ n 2). Then, simulate 200 samples of size n = 5 from the logistic distribution with θ = 2. Find the sample variances of the resulting sample medians and δ n -estimators. How well does the asymptotic theory match reality? 7.3 Asymptotic Efficiency and Superefficiency In Theorem 7.8, we showed that a consistent root ˆθ n of the likelihood equation satisfies ( ) n(ˆθn θ 0 ) d N 0,. I(θ 0 ) In Theorem 7.9, we stated that if θ n is a n-consistent estimator of θ 0 and δ n = θ n l ( θ n )/l ( θ n ), then ( ) n(δn θ 0 ) d N 0,. (7.6) I(θ 0 ) Note the similarity in the last two asymptotic limit statements. There seems to be something special about the limiting variance /I(θ 0 ), and in fact this is true. Much like the Cramér-Rao lower bound states that (under some regularity conditions) an unbiased estimator of θ 0 cannot have a variance smaller than /I(θ 0 ), the following result is true: 44

Theorem 7.0 Suppose that the conditions of theorem 7.8 are satisfied and that δ n is an estimator satisfying n(δn θ 0 ) d N{0, v(θ 0 )} for all θ 0, where v(θ) is continuous. Then v(θ) /I(θ) for all θ. In other words, /I(θ) is, in the sense of Theorem 7.0, the smallest possible asymptotic variance for an estimator. For this reason, we refer to any estimator δ n satisfying (7.6) for all θ 0 an efficient estimator. One condition in Theorem 7.0 that may be a bit puzzling is the condition that v(θ) be continuous. If this condition is dropped, then a well-known counterexample, due to Hodges, exists: Example 7. Suppose that δ n is an efficient estimator of θ 0. Then if we define { δn 0 if n(δn ) = 4 < ; otherwise, δ n it is possible to show (see Exercise 7.9) that δn is superefficient in the sense that ( ) n(δ n θ 0 ) d N 0, I(θ 0 ) for all θ 0 0 but n(δ n θ 0 ) d 0 if θ 0 = 0. In other words, when the true value of the parameter θ 0 is 0, then δ n does much better than an efficient estimator; yet when θ 0 0, δ n does just as well. Just as the invariance property of maximum likelihood estimation states that the MLE of a function of θ equals the same function applied to the MLE of θ, a function of an efficient estimator is itself efficient: Theorem 7.2 If δ n is efficient for θ 0, and if g(θ) is a differentiable and invertible function with g (θ 0 ) 0, g(δ n ) is efficient for g(θ 0 ). The proof of the above theorem follows immediately from the delta method, since the Fisher information for g(θ) is I(θ)/{g (θ)} 2 by Equation (7.6). In fact, if one simply remembers the content of Theorem 7.2, then it is not necessary to memorize Equation (7.6), for it is always possible to rederive this equation quickly using the delta method! We have already noted that (under suitable regularity conditions) if θ n is a n-consistent estimator of θ 0 and δ n = θ n l ( θ n ) l ( θ n ), (7.7) 45

then δ n is an efficient estimator of θ 0. Alternatively, we may set δ n = θ n + l ( θ n ) ni( θ n ) (7.8) and δ n is also an efficient estimator of θ 0. Problem 7.7 asked you to prove the former fact regarding δ n ; the latter fact regarding δ n is proved in nearly the same way because n l ( θ n ) P I(θ 0 ). In Equation (7.7), as already remarked earlier, δ n results from a single step of Newton s method; in Equation (7.8), δ n results from a similar method called Fisher scoring. As is clear from comparing Equations (7.7) and (7.8), scoring differs from Newton s method in that the Fisher information is used in place of the negative second derivative of the loglikelihood function. In some examples, scoring and Newton s method are equivalent. A note on terminology: The derivative of l(θ) is sometimes called the score function. Furthermore, ni(θ) and l (θ) are sometimes referred to as the expected information and the observed information, respectively. Example 7.3 Suppose X,..., X n are independent from a Cauchy location family with density Then f θ (x) = l (θ) = π{ + (x θ) 2 }. n i= 2(x i θ) + (x i θ) 2, so the likelihood equation is very difficult to solve. However, an efficient estimator may still be created by starting with some n-consistent estimator θ n, say the sample median, and using either Equation (7.7) or Equation (7.8). In the latter case, we obtain the simple estimator verification of which is the subject of Problem 7.0. δ n = θ n + 2 n l ( θ n ), (7.9) For the remainder of this section, we turn our attention to Bayes estimators, which give yet another source of efficient estimators. A Bayes estimator is the expected value of the posterior distribution, which of course depends on the prior chosen. Although we do not prove this fact here (see Ferguson 2 for details), any Bayes estimator is efficient under some very general conditions. The conditions are essentially those of Theorem 7.8 along with the stipulation that the prior density is positive everywhere on Ω. (Note that if the prior density is not positive on Ω, the Bayes estimator may not even be consistent.) 46

Example 7.4 Consider the binomial distribution with beta prior, say X binomial(n, p) and p beta(a, b). Then the posterior density of p is proportional to the product of the likelihood and the prior, which (ignoring multiplicative constants not involving p) equals p a ( p) b p X ( p) n X. Therefore, the posterior distribution of p is beta(a + X, b + n X). The Bayes estimator is the expectation of this distribution, which equals (a+x)/(a+b+n). If we let γ n denote the Bayes estimator here, then n(γn p) = ( ) X n n p + n ( γ n X n We may see that the rightmost term converges to zero in probability by writing ( n γ n X ) ( n = a (a + b) X ), n a + b + n n since a (a + b)x/n P a (a + b)p by the weak law of large numbers. In other words, the Bayes estimator in this example has the same limiting distribution as the MLE, X/n. It is possible to verify that the MLE is efficient in this case. The central question when constructing a Bayes estimator is how to choose the prior distribution. We consider one class of prior distributions, called Jeffreys priors. Note that these priors are named for Harold Jeffreys, so it is incorrect to write Jeffrey s priors. For a Bayes estimator ˆθ n of θ 0, we have ( ) n(ˆθn θ 0 ) d N 0,. I(θ 0 ) Since the limiting variance of ˆθ n depends on I(θ 0 ), if I(θ) is not a constant, then some values of θ may be estimated more precisely than others. In analogy with the idea of variancestabilizing transformations seen in Section 5..2, we might consider a reparameterization η = g(θ) such that {g (θ)} 2 /I(θ) is a constant. More precisely, if g (θ) = c I(θ), then n { g(ˆθ n ) η 0 } d N(0, c 2 ). So as not to influence the estimation of η, we choose as the Jeffreys prior a uniform prior on η. Therefore, the Jeffreys prior density on θ is proportional to g (θ), which is proportional to I(θ). Note that this may lead to an improper prior. ). 47

Example 7.5 In the case of Example 7.4, we may verify that for a Bernoulli(p) observation, I(p) = E d2 {X log p + ( X) log( p)} dp2 ( X = E p + X ) = 2 ( p) 2 p( p). Thus, the Jeffreys prior on p in this case has a density proportional to p /2 ( p) /2. In other words, the prior is beta(, ). Therefore, the Bayes estimator 2 2 corresponding to the Jeffreys prior is γ n = X + 2 n +. Exercises for Section 7.3 Exercise 7.9 Verify the claim made in Example 7.. Exercise 7.0 If f θ (x) forms a location family, so that f θ (x) = f(x θ) for some density f(x), then the Fisher information I(θ) is a constant (you may assume this fact without proof). (a) Verify that for the Cauchy location family, we have I(θ) = 2. f θ (x) = π{ + (x θ) 2 }, (b) For 500 samples of size n = 5 from a standard Cauchy distribution, calculate the sample median θ n and the efficient estimator δ n of Equation (7.9). Compare the variances of θ n and δ n with their theoretical asymptotic limits. Exercise 7. (a) Derive the Jeffreys prior on θ for a random sample from Poisson(θ). Is this prior proper or improper? (b) What is the Bayes estimator of θ for the Jeffreys prior? Verify directly that this estimator is efficient. 48

Exercise 7.2 (a) Derive the Jeffreys prior on σ 2 for a random sample from N(0, σ 2 ). Is this prior proper or improper? (b) What is the Bayes estimator of σ 2 for the Jeffreys prior? Verify directly that this estimator is efficient. 7.4 The multiparameter case Suppose now that the parameter is the vector θ = (θ,..., θ k ). If X f θ (x), then I(θ), the information matrix, is the k k matrix I(θ) = E θ { hθ (X)h θ (X) }, where h θ (x) = θ [log f θ (x)]. This is a rare instance in which it s probably clearer to use component-wise notation than vector notation: Definition 7.6 Given a k-dimensional parameter vector θ and a density function f θ (x), the Fisher information matrix I(θ) is the k k matrix whose (i, j) element equals { I ij (θ) = E θ log f θ (X) } log f θ (X), θ i θ j as long as the above quantity is defined for all (i, j). Note that the one-dimensional Definition 7.6 is a special case of Definition 7.6. Example 7.7 Let θ = (µ, σ 2 ) and suppose X N(µ, σ 2 ). Then so µ log f θ(x) = x µ σ 2 log f θ (x) = 2 log σ2 and (x µ)2 2σ 2 log 2π, σ log f θ(x) = ( ) (x µ) 2. 2 2σ 2 σ 2 Thus, the entries in the information matrix are as follows: ( ) (X µ) 2 I (θ) = E θ = σ 4 σ, ( 2 (X µ) 3 I 2 (θ) = I 2 (θ) = E θ X µ ) = 0, 2σ 6 2σ ( 4 ) (X µ)2 (X µ)4 I 22 (θ) = E θ + = 4σ2 2σ 6 4σ 8 4σ σ2 4 2σ + 3σ4 6 4σ = 8 2σ. 4 49

As in the one-dimensional case, Definition 7.6 is often not the easiest form of I(θ) to work with. This fact is illustrated by Example 7.7, in which the calculation of the information matrix requires the evaluation of a fourth moment as well as a lot of algebra. However, in analogy with Equations (7.7) and (7.8), if 0 = θ i E θ f θ (X) f θ (X) = E θ θ i f θ (X) f θ (X) and 0 = 2 f θ (X) E θ θ i θ j f θ (X) = E θ 2 θ i θ j f θ (X) f θ (X) (7.20) for all i and j, then the following alternative forms of I(θ) are valid: ( ) I ij (θ) = Cov θ log f θ (X), log f θ (X) (7.2) θ i θ j ( ) 2 = E θ log f θ (X). (7.22) θ i θ j Example 7.8 In the normal case of Example 7.7, the information matrix is perhaps a bit easier to compute using Equation (7.22), since we obtain and 2 µ 2 log f θ(x) = σ 2, 2 Taking expectations gives µ σ 2 log f θ(x) = x µ σ 4, 2 (σ 2 ) log f θ(x) = (x µ)2. 2 2σ4 σ 6 I(θ) = ( σ 2 0 0 2σ 4 as before but without requiring any fourth moments. By Equation (7.2), the Fisher information matrix is nonnegative definite, just as the Fisher information is nonnegative in the one-parameter case. A further fact the generalizes into the multiparameter case is the additivity of information: If X and Y are independent, then I X (θ) + I Y (θ) = I (X,Y ) (θ). Finally, suppose that η = g(θ) is a reparameterization, where g(θ) is invertible and differentiable. Then if J is the Jacobian matrix of the inverse transformation (i.e., J ij = θ i / η j ), then the information about η is I(η) = J I(θ)J. As we ll see later, almost all of the same efficiency results that applied to the one-parameter case apply to the multiparameter case as well. In particular, we will see that an efficient estimator ˆθ is one that satisfies n(ˆθ θ 0 ) d N k {0, I(θ 0 ) }. ) 50

Note that the formula for the information under a reparameterization implies that if ˆη is an efficient estimator for η 0 and the matrix J is invertible, then g (ˆη) is an efficient estimator for θ 0 = g (η 0 ), since n{g (ˆη) g (η 0 )} d N k {0, J(J I(θ 0 )J) J }, and the covariance matrix above equals I(θ 0 ). As in the one-parameter case, the likelihood equation is obtained by setting the derivative of l(θ) equal to zero. In the multiparameter case, though, the gradient l(θ) is a k vector, so the likelihood equation becomes l(θ) = 0. Since there are really k univariate equations implied by this likelihood equation, it is common to refer to the likelihood equations (plural), which are θ i l(θ) = 0 for i =,..., k. In the multiparameter case, we have essentially the same theorems as in the one-parameter case. Theorem 7.9 Suppose that X,..., X n are independent and identically distributed random variables (or vectors) with density f θ 0(x) for θ 0 in an open subset Ω of R k, where distinct values of θ 0 yield distinct distributions for X (i.e., the parameter is identifiable). Furthermore, suppose that the support set {x : f θ (x) > 0} does not depend on θ. Then with probability approaching as n, there exists ˆθ such that l(ˆθ) = 0 and ˆθ P θ 0. As in the one-parameter case, we shall refer to the ˆθ guaranteed by Theorem 7.9 as a consistent root of the likelihood equations. Unlike Theorem 7.4, however, Corollary 7.5 does not generalize to the multiparameter case because it is possible that ˆθ is the unique solution of the likelihood equations and a local maximum but not the MLE. The best we can say is the following: Corollary 7.20 Under the conditions of Theorem 7.9, if there is a unique root of the likelihood equations, then this root is consistent for θ 0. The asymptotic normality of a consistent root of the likelihood equation holds in the multiparameter case just as in the single-parameter case: Theorem 7.2 Suppose that the conditions of Theorem 7.9 are satisfied and that ˆθ denotes a consistent root of the likelihood equations. Assume also that Equation (7.20) is satisfied for all i and j and that I(θ 0 ) is positive definite with finite entries. Finally, assume that 3 log f θ (x)/ θ i θ j θ k exists and is bounded in a neighborhood of θ 0 for all i, j, k. Then n(ˆθ θ 0 ) d N k {0, I (θ 0 )}. 5

As in the one-parameter case, there is some terminology associated with the derivatives of the loglikelihood function. The gradient vector l(θ) is called the score vector. The negative second derivative 2 l(θ) is called the observed information, and ni(θ) is sometimes called the expected information. And the second derivative of a real-valued function of a k-vector, such as the loglikelihood function l(θ), is called the Hessian matrix. Newton s method (often called the Newton-Raphson method in the multivariate case) and scoring work just as they do in the one-parameter case. Starting from the point θ n, one step of Newton-Raphson gives δ n = θ { n 2 l( θ n )} l( θn ) (7.23) and one step of scoring gives δ n = θ n + n I ( θ n ) l( θ n ). (7.24) Theorem 7.22 Under the assumptions of Theorem 7.2, if θ n is a n-consistent estimator of θ 0, then the one-step Newton-Raphson estimator δ n defined in Equation (7.23) satisfies n(δn θ 0 ) d N k {0, I (θ 0 )} and the one-step scoring estimator δ n defined in Equation (7.24) satisfies n(δ n θ 0 ) d N k {0, I (θ 0 )}. As in the one-parameter case, we define an efficient estimator ˆθ n as one that satisfies n(ˆθn θ 0 ) d N k {0, I (θ 0 )}. This definition is justified by the fact that if δ n is any estimator satisfying n(δn θ) d N k {0, Σ(θ)}, where I (θ) and Σ(θ) are continuous, then Σ(θ) I (θ) is nonnegative definite for all θ. This is analagous to saying that σ 2 (θ) I (θ) is nonnegative in the univariate case, which is to say that I (θ) is the smallest possible variance. Finally, we note that Bayes estimators are efficient, just as in the one-parameter case. This means that the same three types of estimators that are efficient in the one-parameter case the consistent root of the likelihood equation, the one-step scoring and Newton-Raphson estimators, and Bayes estimators are also efficient in the multiparameter case. 52

Exercises for Section 7.4 Exercise 7.3 Let X multinomial(, p), where p is a k-vector for k > 2. Let p = (p,..., p k ). Find I(p ). Exercise 7.4 Suppose that θ R R + (that is, θ R and θ 2 (0, )) and f θ (x) = ( ) x θ f θ 2 θ 2 for some continuous, differentiable density f(x) that is symmetric about the origin. Find I(θ). Exercise 7.5 Prove Theorem 7.2. Hint: Use Theorem.40. Exercise 7.6 Prove Theorem 7.22. Hint: Use Theorem.40. Exercise 7.7 The multivariate generalization of a beta distribution is a Dirichlet distribution, which is the natural prior distribution for a multinomial likelihood. If p is a random (k+)-vector constrained so that p i > 0 for all i and k+ i= p i =, then (p,..., p k ) has a Dirichlet distribution with parameters a > 0,..., a k+ > 0 if its density is proportional to { { } k } p a p a k k ( p p k ) ak+ I min p i > 0 I p i <. i Prove that if G,..., G k+ are independent random variables with G i gamma(a i, ), then G + + G k+ (G,..., G k ) has a Dirichlet distribution with parameters a,..., a k+. i= 7.5 Nuisance parameters This section is the one intrinsically multivariate section in this chapter; it does not have an analogue in the one-parameter setting. Here we consider how efficiency of an estimator is affected by the presence of nuisance parameters. 53

Suppose θ is the parameter vector but θ is the only parameter of interest, so that θ 2,..., θ k are nuisance parameters. We are interested in how the asymptotic precision with which we may estimate θ is influenced by the presence of the nuisance parameters. In other words, if ˆθ is efficient for θ, then how does ˆθ as an estimator of θ compare to an efficient estimator of θ, say θ, under the assumption that all of the nuisance parameters are known? Assume I(θ) is positive definite. Let σ ij denote the (i, j) entry of I(θ) and let γ ij denote the (i, j) entry of I (θ). If all of the nuisance parameters are known, then I(θ ) = σ, which means that the asymptotic variance of n(θ θ ) is /σ. On the other hand, if the nuisance parameters are not known then the asymptotic variance of n(ˆθ θ) is I (θ), which means that the marginal asymptotic variance of n( ˆθ θ ) is γ. Of interest here is the comparison between γ and /σ. The following theorem may be interpreted to mean that the presence of nuisance parameters always increases the variance of an efficient estimator. Theorem 7.23 γ /σ, with equality if and only if γ 2 = = γ k = 0. Proof: Partition I(θ) as follows: ( σ ρ I(θ) = ρ Σ ), where ρ and Σ are (k ) and (k ) (k ), respectively. Let τ = σ ρ Σ ρ. We may verify that if τ > 0, then I (θ) = ( ) ρ Σ τ Σ ρ Σ ρρ Σ + τσ. This proves the result, because the positive definiteness of I(θ) implies that Σ is positive definite, which means that γ = σ ρ Σ ρ, σ with equality if and only if ρ = 0. Thus, it remains only to show that τ > 0. But this is immediate from the positive definiteness of I(θ), since if we set ( ) v =, ρ Σ then τ = v I(θ)v. The above result shows that it is important to take nuisance parameters into account in estimation. However, it is not necessary to estimate the entire parameter vector all at once, 54

since ( ˆθ,..., ˆθ k ) is efficient for θ if and only if each of the ˆθ i is efficient for θ i in the presence of the other nuisance parameters (see problem 7.8). Exercises for Section 7.5 Exercise 7.8 Letting γ ij denote the (i, j) entry of I (θ), we say that ˆθ i is efficient for θ i in the presence of the nuisance parameters θ,..., θ i, θ i+,..., θ k if the asymptotic variance of n(ˆθ i θ i ) is γ ii. Prove that ( ˆθ,..., ˆθ k ) is efficient for θ if and only if for all i, the estimator ˆθ i is efficient for θ i in the presence of nuisance parameters θ,..., θ i, θ i+,..., θ k. 55