Chapter 3. Point Estimation. 3.1 Introduction

Similar documents
Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Brief Review on Estimation Theory

ECE531 Lecture 10b: Maximum Likelihood Estimation

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

Introduction to Estimation Methods for Time Series models Lecture 2

Maximum Likelihood Estimation

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Maximum Likelihood Estimation

Chapter 8.8.1: A factorization theorem

Chapter 4. Theory of Tests. 4.1 Introduction

Math 494: Mathematical Statistics

1. Fisher Information

Economics 620, Lecture 9: Asymptotics III: Maximum Likelihood Estimation

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Parameter Estimation

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

ECE 275B Homework # 1 Solutions Winter 2018

ECE 275B Homework # 1 Solutions Version Winter 2015

A General Overview of Parametric Estimation and Inference Techniques.

STAT 512 sp 2018 Summary Sheet

Mathematical Statistics

Chapter 4: Asymptotic Properties of the MLE

Classical Estimation Topics

6.1 Variational representation of f-divergences

Lecture 7 Introduction to Statistical Decision Theory

A Very Brief Summary of Statistical Inference, and Examples

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

5.2 Fisher information and the Cramer-Rao bound

STAT 730 Chapter 4: Estimation

Lecture 1: Introduction

DA Freedman Notes on the MLE Fall 2003

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Evaluating the Performance of Estimators (Section 7.3)

A Very Brief Summary of Statistical Inference, and Examples

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

ECE 275A Homework 7 Solutions

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Inference in non-linear time series

3.1 General Principles of Estimation.

Proof In the CR proof. and

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

Mathematical statistics

Submitted to the Brazilian Journal of Probability and Statistics

ECE 275A Homework 6 Solutions

The properties of L p -GMM estimators

Lecture 3 September 1

STAT215: Solutions for Homework 2

Graduate Econometrics I: Unbiased Estimation

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Section 8.2. Asymptotic normality

f(y θ) = g(t (y) θ)h(y)

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Statistical Inference

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Chapter 7. Hypothesis Testing

ST5215: Advanced Statistical Theory

A Few Notes on Fisher Information (WIP)

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Principles of Statistics

Z-estimators (generalized method of moments)

Generalized Linear Models. Kurt Hornik

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Lecture 8: Information Theory and Statistics

Probability on a Riemannian Manifold

Final Examination Statistics 200C. T. Ferguson June 11, 2009

ST5215: Advanced Statistical Theory

1. (Regular) Exponential Family

Central Limit Theorem ( 5.3)

Econometrics I, Estimation

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Review and continuation from last week Properties of MLEs

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Section 8: Asymptotic Properties of the MLE

Parametric Inference

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

Graduate Econometrics I: Maximum Likelihood I

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

i=1 h n (ˆθ n ) = 0. (2)

Answers to the 8th problem set. f(x θ = θ 0 ) L(θ 0 )

5601 Notes: The Sandwich Estimator

simple if it completely specifies the density of x

Theory of Statistics.

Elements of statistics (MATH0487-1)

SOLUTION FOR HOMEWORK 7, STAT p(x σ) = (1/[2πσ 2 ] 1/2 )e (x µ)2 /2σ 2.

F & B Approaches to a simple model

Information in a Two-Stage Adaptive Optimal Design

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Graduate Econometrics I: Maximum Likelihood II

Transcription:

Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e. γ : (Θ, B Θ ) (γ(θ), B γ ) 3.1 Introduction Def 3.1.1 An estimator T is a measurable function T : (X, B X ) (γ(θ), B γ ). Of course, it is hoped that T (X) will tend to be close to the unknown estimand γ(θ), but this requirement is not part of a formal definition of an estimator. Desirable properties of an estimator are: Unbiasedness Consistency (strong, weak, in r th mean) Sufficiency Asymptotic Normality Minimal Sufficiency, Completeness, Invariance,... 41

42 CHAPTER 3. POINT ESTIMATION In the sequel we are interested in unbiased estimators and we shall lern about a further statistical criterion: efficiency. Def 3.1.2 Let γ : Θ IR m be measurable. (a) A statistic T : (X, B X ) (IR m, B m ) is called unbiased, if E θ (T ) = γ(θ) θ Θ. (b) Each function γ on Θ, for which there exists an unbiased estimator, is called an estimable function. (c) For a biased estimator b(γ(θ), T ) := E θ (T ) γ(θ) is called the bias. (d) An estimator T is called asymptotically unbiased for γ(θ) if lim b(γ(θ), T n) = 0. n Def 3.1.3 An estimator T is called median unbiased for γ(θ), if med θ (T ) = γ(θ) θ Θ. If T is unbiased for γ(θ), then in general g(t ) is biased for γ(θ), unless g is linear. Unbiased estimators do not always exist. Unbiased estimators are not always reasonable.

3.2. MINIMUM VARIANCE UNBIASED ESTIMATORS 43 3.2 Minimum Variance Unbiased Estimators In the sequel the case Θ IR is considered. Def 3.2.1 Let T be the set of all unbiased estimators T of θ with E θ (T 2 ) < for all θ Θ and let T θ0 be the set of all unbiased estimators T of θ 0 with E θ0 (T 2 ) < (a) T 0 T θ0 is called locally minimum variance unbiased estimator (LMVUE) in θ 0, if E θ0 [(T 0 θ 0 ) 2 ] E θ0 [(T θ 0 ) 2 ] for all T T θ0. (b) T T is called uniformly minimum variance unbiased estimator (UMVUE), if E θ [(T θ) 2 ] E θ [(T θ) 2 ] for all T T and θ Θ. Other names are: (locally) best unbiased estimator and in the case of a linear estimator (locally) best linear unbiased estimator (BLUE). Theorem 3.2.1 Let T be as in Def. 3.2.1, T =, and let T (0) be the set of all unbiased estimators of the zero, i.e. T (0) = {T 0 E θ (T 0 ) = 0, E θ (T 2 ) < θ Θ}. Then it holds that T T is UMVUE if and only if E θ (T 0 T ) = 0 for all θ Θ and T 0 T (0). Proof: According to the above assumption E θ [T 0 T ] exists for all θ Θ and T 0 T (0). Necessity: Suppose T T is UMVUE and there exists a θ 0 Θ and a T 0 T (0) such that E θ0 [T 0 T ] 0. Then T + λt 0 T for all λ IR. In case E θ0 [T0 2 ] = 0

44 CHAPTER 3. POINT ESTIMATION E θ0 [T 0 T ] = 0 (Schwarz inequality). Let hence E θ0 [T0 2 ] > 0 and choose λ 0 = E θ0 [T 0 T ]/E θ0 [T0 2 ]. Then for T + λt 0 = T T 0 E θ0 [T 0 T ]/E θ0 [T0 2 ] it holds that E θ0 [(T + λt 0 ) 2 ] = E θ0 [T 2 ] Eθ 2 0 [T 0 T ]/E θ0 (T0 2 ) < E θ0 (T 2 ) or V ar θ0 [T + λ 0 T 0 ] < V ar θ0 [T ] (contradiction!). Sufficiency: Suppose E θ [T 0 T ] = 0 holds for a T T and let T T. Then T T T (0) and from the above condition it follows that E θ [T (T T )] = 0 for all θ Θ, which entails E θ [T 2 ] = E θ [T T ] E θ [T 2 ] 1/2 E θ [(T ) 2 ] 1/2. For E θ [(T ) 2 ] = 0 there is nothing to prove. For E θ [(T ) 2 ] > 0 it follows that E θ [(T ) 2 ] E θ (T 2 ) for all θ Θ, hence V ar θ [T ] V ar θ [T ]for all θ Θ and T T. Theorem 3.2.2 Let T. Then there exists at most one UMVUE. Proof: Let T and T be both UMVUE s. Then T T T (0), hence E θ [T ( T T )] = 0 or E θ [T T ] = Eθ [(T ) 2 ] or Cov θ (T, T ) = V ar θ (T ) = V ar θ ( T ), from which Corr(T, T ) = 1 follows for all θ Θ. Therefore there exist a, b IR with P θ (a T + b T = 0) = 1 for all θ Θ. Since E θ (a T + b T ) = (a + b)θ for all θ it follows that P θ (T = T ) = 1 for all θ Θ. Theorem 3.2.3 (Rao-Blackwell) Let P = {P θ θ Θ}, T T and let S be sufficient for P. Then (a) E θ [(T S)] is independent of θ and an unbiased estimator for θ for all θ Θ and

3.2. MINIMUM VARIANCE UNBIASED ESTIMATORS 45 (b) E θ [(E(T S) θ) 2 ] E θ [(T θ) 2 ] if P θ (T = E(T S)) = 1 θ Θ. θ Θ. Equality holds if and only Proof: The independence from θ follows from the independence of the conditional distributions P X S=s and the unbiasedness from E θ [E(T S)] = E θ [T ] = θ. Therefore it is sufficient to show that E θ [E(T S) 2 ] E θ [T 2 ] for all θ Θ. Now E θ [T 2 ] = E θ [T 2 S]. Hence we have to show that [E(T S) 2 E[E(T 2 S)] holds P θ - a. e. for all θ Θ. But this follows from Schwarz s inequality (add E[1/S]). Equality holds in (b) if and only if i.e. E θ [E(T S) 2 ] = E θ (T 2 ), E θ [E[T 2 S] E 2 [T S]] = 0 which is equivalent to E θ [V ar(t S)] = 0 E[T 2 S] = E 2 [T S] P θ -a.e. T = E[T S]P θ -a.e. for all θ Θ. Theorem 3.2.4 (Lehmann-Scheffé) If S is a complete sufficient statistic and if T T, then there exists an UMVUE, and it is given by E(T S). Proof: For T 1, T 2 T E θ [E(T 1 S) E(T 2 S)] = 0 holds for all θ Θ. Since S is complete E[T 1 S] = E[T 2 S] holds P θ -a.e., and this is the UMVUE according to Theorem 3.2.3. Remark: (a) According to Rao & Blackwell s Theorem one should look to find unbiased functions of a sufficient statistic. If this sufficient statistic is complete, then this function is the UMVUE. (b) UMVUE s may exist, even if there does not exist a sufficient statistic.

46 CHAPTER 3. POINT ESTIMATION Theorem 3.2.5 (Cramér-Rao-Fréchét) Let P = {P θ θ Θ} with µ densities (µ = # or µ = λ) and let Θ be an open interval in IR 1. {x f θ (x) = 0} be independent of θ Θ. For every θ let f θ (x)/ be defined. Suppose that (i) fθ dµ = f θdµ = 0 θ Θ. (ii) Let γ : Θ IR be differentiable on Θ, and let T be an unbiased estimator for γ(θ) such that E θ (T 2 ) < for all θ Θ. Let further T (x)f(x; θ)µ(dx) = T (x) f θ(x)µ(dx) θ Θ. Then (a) [ ] 2 [γ (θ)] 2 E θ [(T γ(θ)) 2 ] E θ log f(x; θ) θ Θ. For any θ 0 Θ, either γ (θ 0 ) = 0 and equality holds in (a) for θ = θ 0, or (b) Var θ0 (T ) = E θ [(T γ(θ)) 2 ] [γ (θ)] 2 E θ ( [ log f(x;θ) ] 2 ). If, in the latter case, equality holds in (b) and if T is not a constant, then there exists a real number K θ0 0 such that (c) T (x) γ(θ) = K θ0 log f(x; θ 0 ) µ a.e.

3.2. MINIMUM VARIANCE UNBIASED ESTIMATORS 47 Remarks: The function log f(x; θ 0 )/ is also called score function and [ ] 2 ( ) E θ log f(x; θ) log f(x; θ) = Var θ is called the Fisher Information I(θ). For γ(θ) = θ γ (θ) = 1, of course. Proof: Differentiating both sides of f(x; θ)µ(dx) = 1 leads (with (i)) to f(x; θ)µ(dx) = 0, or on {f > 0} to f(x; θ) {f>0} {f>0} f(x; θ) f(x; θ)µ(dx) = 0 logf(x; θ) f(x; θ)µ(dx) = 0, leading to E θ [ logf(x; θ) According to assumption (ii) we have γ(θ) = T (x)f(x; θ)µ(dx), γ (θ) = E θ [T (X) which entails logf(x; θ) ], ( ) logf(x; θ) E θ [T (X) γ(θ)] = γ (θ) and (a) follows from Schwarz inequality. ] = 0. For (b) it is sufficient to consider either the case γ (θ 0 ) 0 or the case, where in (a) the < -sign holds for a θ 0. In both cases for the Fisher- Information I (θ 0 ) > 0 holds, which entails (b).

48 CHAPTER 3. POINT ESTIMATION If in (b) the = -sign holds, then γ (θ 0 ) 0 must hold. Then according to Schwarz (in)equality there exists a real number K θ0, such that T (X) γ(θ 0 ) = K θ0 logf(x; θ 0 ) holds µ-a.e. Let for the vector case, Θ IR P, γ(θ) be a convex subset of IR k. Then f(x; θ)/ is a p vector, ( ) ( ) T I(θ) = E log f(x; θ) log f(x; θ) is a p p matrix, γ(θ) is a k p matrix and Var θ (T ) = E θ [ (T (X) γ(θ)) (T (X) γ(θ)) ] is a k k matrix. With the corresponding regularity conditions of Theorem 3.2.5 one can easily show the corresponding inequality (d) Var θ (T ) ( ) ( ) T γ(θ) γ(θ) I(θ) 1, where the sign is to be understood as the difference between the left and the right hand side being a positive semidefinite matrix. For a proof of Theorem 3.2.5 in the multiparameter case we refer to Lehmann/Casella (2001), pp. 124-125. Theorem 3.2.6: In the above case let p = k and assume that the k k matrix (γ) be regular for all θ Θ, and let f/ be continuous for all θ and x. Then in (d) the equality sign holds if and only if there are functions C(θ), Q 1 (θ),..., Q k (θ) and H(x), such that dp k = f(x; θ) = C(θ) exp Q dµ j (θ)t j (x) H(x), j=1

3.2. MINIMUM VARIANCE UNBIASED ESTIMATORS 49 and with Q(θ) = (Q 1 (θ),..., Q k (θ)) it holds that [( ) ] 1 [ ] Q lnc γ(θ) =. Proof: 1. Let f and γ have the above form. Then we show that in the CRinequality the = - sign holds. From f as above we have with c(θ) = exp {D(θ)} logf(x; θ) = Q (θ)t ((X) + D (θ), where Q (θ) = ( ) f(x;θ i ) j i, j = 1,..., k, [ ] and since E θ = 0, logf(x;θ) 0 = E θ [Q (θ)t (X) + D (θ)] = D (θ) + Q (θ)e θ [T (X)], weobtainfordet (Q (θ)) 0 (which may E θ [T (X)] = Q (θ) 1 D (θ). Hence the estimator ˆγ(θ) = T (X) is unbiased for γ(θ). Since T (X) γ(θ) = T (X) + Q (θ) 1 D (θ), we get, by putting K(θ) = Q (θ) 1, that K θ logf(x; θ) = T (X) + Q (θ) 1 D(θ), i. e. the equality sign holds in the CR-inequality. 2. From the CR-equality the above representation of f and γ follows. If equality holds, then there exists a regular (k x k)-matrix K θ such that T (X) γ(θ) = K θ logf(x; θ) µ a.e.

50 CHAPTER 3. POINT ESTIMATION or K 1 θ [T (X) γ(θ)] = logf(x; θ). We integrate both sides with resprect to θ, where we put D(θ) := Kθ 1 γ(θ)dθ and = Q(θ) := Kθ 1 dθ. Introducing an integration constant S(X), which generally depends on X, leads to ln f(x; θ) = Q(θ)T (X) + D(θ) + S(X), and with C(θ) :=exp {D(θ)} and H(X) = exp {S(X)} f and γ have the claimed form with ˆγ(θ) = T (X).. Corollary 3.2.7: If under the regularity conditions of Theorem 3.2.6 T is an unbiased estimator for γ(θ) which assumes the Cramér-Rao lower bound, then T is minimal sufficient and complete. An unbiased estimator, which assumes the CR-bound, is called an efficient estimator. In the scalar case the ratio e(t, θ) between the CR-bound and V ar θ (T ) is called the efficiency of the estimator T. Obviously, 0 e(t, θ) 1. When comparing two unbiased estimators T 1 and T 2, e θ (T 1 T 2 ) := V ar θ(t 2 ) V ar θ (T 1 ) is called the relative efficiency of T 1 with respect to T 2. lim e θ(t ) is called the asymptotic efficiency and n lim e θ(t 1 T 2 ) is called the asymptotic relative efficiency. n

3.3. METHOD OF MOMENTS 51 3.3 Method of Moments Let P = {P θ θ Θ} and γ : Θ IR k. In many cases, the estimands γ(θ) can be written as functions of the moments of P θ, γ(θ) = g(µ 1,..., µ k). In order to estimate γ(θ), one then may try to estimate γ(θ) by replacing the unknown moments µ j, j = 1,..., k, by the corresponding sample moments. Let T be any statistic with existing expectation µ t (θ) := E θ (T (X)) for all θ Θ. Then the SLLN (Chinchine) entails T n := (T (X 1 ) + T (X 2 ) +... + T (X n ))/n µ t (θ) a.s. If Θ IR k and a statistic T = (T 1,..., T k ) with existing expectation µ T (θ) = (µ t1 (θ),..., µ tk (θ)), then one can try to find an estimator ˆθ n = (ˆθ 1,n,..., ˆθ k,n ) as a solution of the system of equations ˆµ t1 (ˆθ 1,n..., ˆθ k,n ) = (T 1 (X 1 ) +... + T 1 (X n ))/n =: T 1,n... ˆµ tk (ˆθ 1,n..., ˆθ k,n ) = (T k (X 1 ) +... + T k (X n ))/n =: T k,n. Under regularity conditions we have then (SLLN) ˆγ(ˆθ n ) = g(ˆµ t1,..., ˆµ tk ) g(µ t1,..., µ tk ) = γ(θ). If the moments up to order 2k exist, then according to the Lindeberg-Levy Central Limit Theorem the asymptotic normality of ˆγ(ˆθ n ) can be proved. Remark: In general method of moments estimators are not unique, and they are in general not functions of sufficent statistics and so they cannot be efficient either. 3.4 Maximum Likelihood Estimation Def. 3.4.1: A solution ˆθ of sup L(θ; x) (3.2) θ Θ

52 CHAPTER 3. POINT ESTIMATION is called a Maximum Likelihood Estimator for θ. With the ML principle one tries to find the mode of the underlying distribution. Since very often the mode as an estimator of location is worse than the mean or the median, ML estimators often have poor small sample properties. Often it is simpler in practice to work with the log-likelihood function l than with L. If the µ density f(x; θ) is positive µ a.e., if Θ IR k is an open set and if ( /)f(x; θ) exists on Θ, then a solution of 3.2 fulfills the likelihood equations θ l(θ; x) := log f(x; θ) = 0. (3.3) A solution of 3.3 is called a MLE in the weak sense, a solution of 3.2 is called a strict MLE. Theorem 3.4.1: Let Θ IR k and Λ IR p be intervals, p k, and let γ : Θ Λ be surjective. If ˆθ is MLE for θ, then γ(ˆθ) is MLE for γ(θ). Proof: For each λ Λ let Θ λ := {θ Θ γ(θ) = λ} and let M(λ; x) := sup θ Θ λ L(θ; x). Let ˆθ be a MLE for θ. Then ˆθ belongs to one of the sets Θ λ, e.g. to Θˆλ, and it holds M(ˆλ; x) = sup L(θ; x) L(ˆθ; x) and λ maximizes M, θ Θˆλ since we have M(ˆλ; x) sup λ Λ M(λ; x) = sup L(θ; x) = L(ˆθ; x). θ Θ Theorem 3.4.2: Let S be a sufficient statistic for P = {P θ θ Θ} µ (σ finite). If a unique MLE ˆθ exists, then it is a (measurable) function of S.

3.4. MAXIMUM LIKELIHOOD ESTIMATION 53 Proof: Since S is sufficient, there exists a factorization f(x; θ) = g(s(x); θ)h(x). Maximizing f with resprect to θ is hence equivalent to maximizing g with resprect to θ, and g is a function of S, and ˆθ depends on x only through S. Remark: If the lilkelihood equations (3.3) exist and if there exists a sufficient statistic S, then the MLE s are given as a solution of log g(s(x); θ) = 0. Theorem 3.4.3: Suppose that the regularity conditions of the CR inequality are satisfied and that θ belongs to an open interval in IR k. If T is an unbiased estimator for which the covariance matrix attains the CR lower bound, then the likelihood equations have the unique solution ˆγ(θ) = T (X). Proof: According to Theorem 3.2.5 (resp. its multivariate version) there exists a regular matrix K θ such that K θ logf(x; θ) = T (X) γ(θ) µ a.e. and the likelihood equation have the unique solution ˆ γ(θ) = T (X). For large sample considerations we introduce the following regularity conditions: (A0) For θ θ f(x; θ) f(x; θ ) (identifiability). (A1) The support of f(x; θ), i.e. the set A := { x f(x; θ) > 0}, does not depend on θ Θ. (A2) The sample observations X 1,..., X n are iid with a density f(x; θ) with respect to some σ finite measure µ.

54 CHAPTER 3. POINT ESTIMATION (A3) The parameter space Θ contains an open set Θ 0 and the true θ 0 is an interior point of Θ 0. (A4) The density f(x; θ) is differentiable for µ almost all x with respect to θ Θ 0 with derivative f(x; θ) := f(x; θ). Theorem 3.4.4: Let (A0) (A2) hold. Then P θ0 [L(θ 0 ; x) > L(θ; x)] 1 for n and for all θ θ 0. (3.4) Proof: For the proof we refer to Jensen s inequality, according to which for φ konvex on an open interval I with P (X I) = 1 and E(X) < φ[e(x)] E[φ(X)]. (A0) implies 1 n for all θ θ 0. n log[f(x i ; θ)/f(x i ; θ 0 )] < 0 i=1 According to the SLLN the left hand side converges a.s. to E θ0 [log {f(x; θ)/f(x; θ 0 )}]. Since log(.) is a strictly convex function, Jensen s inequality yields E θ0 [log {f(x; θ)/f(x; θ 0 )}] < log {E θ0 [f(x; θ)/f(x; θ 0 )]}, where the right hand side is equal to zero. This entails (3.3) If therefore the density f is a smooth function of θ, then one may expect that the MLE for θ will lie close to θ 0.

3.4. MAXIMUM LIKELIHOOD ESTIMATION 55 Theorem 3.4.5: Let (A0) (A4) hold. Then, with probability going to 1 the likelihood equations n l(θ; x) = 0, f(x j ; θ) f(x j ; θ) = 0 j=1 have a solution ˆθ n with ˆθ n θ 0 in probability for n. Proof: Let δ be sufficiently small such that (according to (A3)) (θ 0 δ, θ 0 + δ) Θ 0 and let S n := {x l(θ 0 x) > l(θ 0 δ x)and l(θ 0 x) > l(θ 0 + δ x)}. According to Theorem 3.4.4 P θ0 (S n ) 1 for n. For each x S n there is hence a ˆθ n with θ 0 δ < ˆθ n < θ 0 + δ, where l(θ; x) takes a local maximum and therefore l(ˆθ n ) = 0. This entails that for each small enough δ there exists a sequence ˆθ n = ˆθ n (δ) of solustions, such that P θ0 ( ˆθ n θ 0 < δ) 1 for n. It remains to show that such a sequence exists which does not depend on δ. Let θn be the solution closest to θ 0. (It exists, since, because of the continuity of l(θ) the limes of a sequence of solutions is itself a solution.) Then it naturally holds that P θ0 ( θn θ 0 < δ) 1 for all δ > 0.. Remark: If the solutions are not unique, then the above Theorem 3.4.5 does not yield a consistent sequence of estimators. θ 0 is unknown and the data don t tell you which root to choose. In order to show asymptotic efficiency for the univariate case further regularity conditions are needed:

56 CHAPTER 3. POINT ESTIMATION (A5) Θ IR is an open interval. (A6) For x A the density f(x; θ) is three times continuously differentiable with respect to θ. (A7) The integral f(x; θ)µ(dx) can be differenciated three times with respect to θ under the integral sign. (A8) For the Fisher information 0 < I(θ) < holds. (A9) To every θ 0 Θ there exists a δ > 0 and a function M(x) (both may depend on θ 0 ) such that 3 log f(x; θ) M(x) 3 for all x A, θ 0 δ < θ < θ 0 + δ with E θ0 [M(x)] <. Theorem 3.4.6: Let the conditions (A1), (A2), (A5) (A9) hold. Then for each consistent sequence ˆθ n of solutions of the likelihood equations holds. n(ˆθn θ 0 ) L N (0, I(θ) 1 ) Proof: For every fixed x A a Taylor series expansion of l(ˆθ n ) around θ 0 yields 0 = log(fx; ˆθ n ) = log (f(x; θ 0)) where θ n lies between θ 0 and ˆθ n. + 1 2 With obvious abbreviations this is equal to + (ˆθn θ ) 2 log (f(x; θ 0 )) 2 ) 2 (ˆθn 3 log (f(x; θ θ n)) 0, 3 0 = l(ˆθ n ) = l(θ 0 ) + (ˆθ n θ 0 ) l(θ o ) + 1 2 (ˆθ n θ 0 ) 2 l (θ n)

3.4. MAXIMUM LIKELIHOOD ESTIMATION 57 or (ˆθ n θ 0 ) [ l(θ0 ) + 1 2 (ˆθ n θ 0 ) ] l (θn) = l(θ 0 ), and for the expression [...] 0 we obtain n(ˆθn θ 0 ) = n 1 l(θ n 0 ) n l(θ 1 0 ) 1 (ˆθ 2n n θ 0 ). (3.4) l (θn) In Theorem 3.4.5 we have already shown that (ˆθ n θ 0 ) converges to zero in probability for n. We will now show that (1) n 1/2 l(θ0 ) converges weakly to a N (0, I(θ 0 )), (2) n 1 l(θ0 ) converges to I(θ 0 ) > 0 a.s. resp. in probability (3) 1 n l (θn) is stochastically bounded. (1): n 1/2 l(θ0 ) = n 1 n n i=1 log (f(x i ; θ 0 )) =: nb n, where according to the SLLN B n converges a.s. to [ ] log (f(x; θ0 )) B 0 = E θ0 = 0. 0 According to the CLT n [B n 0] converges in distribution to a normal distribution with expected value equal to zero and variance E [ ( ) ] 2 B0 2 = E logf(x; θ0 ) = I(θ 0 ) where I(θ 0 ) > 0 according to (A8).

58 CHAPTER 3. POINT ESTIMATION (2): Since with l = log f(x; θ), l f =, l = f we have f. f ( f) 2 f 2 n 1 l(θ0 ) = 1 n n f(x i ; θ 0 ) 2 f(x i ; θ 0 ) f(x i ; θ 0 ). f 2 (X i ; θ 0 ) i=1 According to the SLLN this term converges (a.s. probability) to I(θ 0 ), since and hence also in E θ0 [ 1 n ( f 2 f f )] = 1 2 f n = f 2 f dµ = E θ0 f 2 f dµ fdµ }{{} =0 [ 2 ] log(f(x; θ 0 ) = I(θ 2 0 ). 1 (3): Finally n l (θn) = 1 n n i=1 3 3 log(f(x i ; θ n)), and with (A9) we get 1 n l (θ n) 1 n [M(X 1) +... + M(X n )]. The right hand side converges to E θ n [M(x)] < according to (A9). Since (ˆθ n θ 0 ) converges to zero in probability according to Theorem 3.4.5, the second term in the denominator of (3.4) converges to zero as well. Putting (1) to (3) together we have shown that n(ˆθ n θ 0 ) converges weakly to a N (0, I(θ 0 ) 1 ). Remarks: (1) A sequence of estimators which fulfils the conditions of Theorem 3.4.6 is called an efficient likelihood estimator.

3.4. MAXIMUM LIKELIHOOD ESTIMATION 59 (2) (A6), (A7) entail for all θ Θ 0 [ ] log f(x; θ) (i) E = 0 and [ ] [ ( ) ] (ii) E 2 log f(x; θ) log f(x; θ) 2 2 = E = I(θ). Corollary 3.4.7: Let the conditions of Theorem 3.4.6 hold. If the likelihood equations have a unique solution for all x and n resp. if the probability for multiple roots goes to zero for n, then the MLE is asymptotically efficient. Some final remarks: (1) In general, the likelihood equations (2) cannot be resolved explicitely. In this case the roots can be found only by using numerical procedures. (Problems of existence, uniqueness and convergence of solutions for used algorithms). (2) MLE s need strong prerequisites (conditions). Under certain conditions consistency and asymptotic normality still hold, even if the distributional assumptions do not exactly coincide with reality. But in this case asymptotic efficiency gets lost: Already small deviations between reality and model assumptions can lead to a considerable loss of efficiency. (3) Consistency and asymptotic normality may hold even if some regularity conditions of the above Theorems are violated. For the multivariate case Θ IR k a result like Theorem 3.4.6 can be obtained in a similar way, if the conditions (A5),... are reformulated accordingly.