Chapter 4: Asymptotic Properties of the MLE

Size: px

Start display at page:

Download "Chapter 4: Asymptotic Properties of the MLE"

Brian Campbell
5 years ago
Views:

1 Chapter 4: Asymptotic Properties of the MLE Daniel O. Scharfstein 09/19/13 1 / 1

2 Maximum Likelihood Maximum likelihood is the most powerful tool for estimation. In this part of the course, we will consider the asymptotic properties of the maximum likelihood estimator. In particular, we will study issues of consistency, asymptotic normality, and efficiency. Suppose that X n = (X 1,..., X n ), where the X i s are i.i.d. with common density p(x; θ 0 ) P = {p(x; θ) : θ Θ}. We assume that θ 0 is identified in the sense that if θ θ 0 and θ Θ, then p(x; θ) p(x; θ 0 ) with respect to the dominating measure µ. 2 / 1

3 Maximum Likelihood For fixed θ Θ, the joint density of X n is equal to the product of the individual densities, i.e., n p(x n ; θ) = p(x i ; θ) If we think of p(x n ; θ) as a function of θ with x n held fixed, then we refer to the resulting function as the likelihood function, L(θ; x n ). The maximum likelihood estimate for observed x n is the value θ Θ which maximizes L(θ; x n ), ˆθ(x n ). Prior to observation, x n is unknown, so we consider the maximum likelihood estimator, MLE, to be the value θ Θ which maximizes L(θ; X n ), ˆθ(X n ). Equivalently, the MLE can be taken to be the maximum of the standardized log-likelihood, l(θ; X n ) n = log L(θ; X n) n = 1 n i=1 n log p(x i ; θ) = 1 n i=1 n l(θ; X i ) i=1 3 / 1

4 Maximum Likelihood We will show that the MLE is 1 consistent, ˆθ(X n ) P(θ 0) θ 0 2 asymptotically normal, n(ˆθ(x n ) θ 0 ) D(θ 0) Normal R.V. 3 asymptotically efficient, i.e., if we want to estimate θ 0 by any other estimator within a reasonable class, the MLE is the most precise. In order to establish 1-3, we will have to provide some regularity conditions on the probability model and the class of estimators that will be considered. 4 / 1

5 Consistency We first want to show that if we have a sample of i.i.d. data from a common distribution which belongs to a probability model, then under some regularity conditions on the form of the density, the sequence of estimators, {ˆθ(X n )}, will converge in probability to θ 0. So far, we have not discussed the issue of whether a maximum likelihood estimator exists or if one does whether it is unique. We will get to this, but first we start with a heuristic proof of consistency. 5 / 1

6 Heuristic Proof The MLE is the value θ Θ which maximizes Q(θ; X n ) = 1 n n i=1 l(θ; X i). By the WLLN, we know that Q(θ; X n ) = 1 n n i=1 l(θ; X i ) P(θ 0) Q 0 (θ) = E θ0 [l(θ; X )] = E θ0 [log p(x ; θ)] = {log p(x; θ)}p(x; θ 0 )dµ(x) We expect that, on average, the log-likelihood will be close to the expected log-likelihood. Therefore, we expect that the maximum likelihood estimator will be close to the maximum of the expected log-likelihood. We will show that the expected log-likelihood, Q 0 (θ) is maximized at θ 0 (i.e., the truth). 6 / 1

7 Lemma 4.1 Lemma 8.1: If θ 0 is identified and E θ0 [ log p(x ; θ) ] < for all θ Θ, Q 0 (θ) is uniquely maximized at θ = θ 0. Proof: By Jensen s inequality, we know that for any strictly convex function g( ), E[g(Y )] > g(e[y ]). Take g(y) = log(y). So, for θ θ 0, p(x ; θ) E θ0 [ log( p(x ; θ 0 ) )] > log(e p(x ; θ) θ 0 [ p(x ; θ 0 ) ]) Note that p(x ; θ) E θ0 [ p(x ; θ 0 ) ] = p(x; θ) p(x; θ 0 ) p(x; θ 0)dµ(x) = p(x; θ) = 1 So, E θ0 [ log( p(x ;θ) p(x ;θ 0 ) )] > 0 or Q 0 (θ 0 ) = E θ0 [log p(x ; θ 0 )] > E θ0 [log p(x ; θ)] = Q 0 (θ) This inequality holds for all θ θ 0. 7 / 1

8 Consistency Under technical conditions for the limit of the maximum to be the maximum of the limit, ˆθ(X n ) should converge in probability to θ 0. Sufficient conditions for the maximum of the limit to be the limit of the maximum are that the convergence is uniform and the parameter space is compact. If Q(θ; X n ) lies in the sleeve [Q 0 (θ) ɛ, Q 0 (θ) + ɛ] for all θ, then ˆθ(X n ) must lie in [θ L (ɛ), θ U (ɛ)], i.e., the MLE must be close to the maximum of Q 0 (θ), θ 0. The estimator should then be consistent as long as θ 0 is the true value of the parameter. 8 / 1

9 Consistency It is essential for consistency that the limit Q 0 (θ) have a unique maximum at the true parameter value. If there are multiple maxima, then the above argument will only lead to the estimator being close to one of the maxima, which does not guarantee consistency (because one of the maxima will not be the true value of the parameter). Identification guarantees that Q 0 (θ) has a unique maximum at θ 0. The discussion so far only allows for a compact parameter space. In theory compactness requires that one know bounds on the true parameter value, although this constraint is often ignored in practice. It is possible to drop this assumption if the function Q(θ; X n ) cannot rise too much as θ becomes unbounded. We will discuss this later. 9 / 1

10 Uniform Convergence in Probability Definition 4.1: Q(θ; X n ) converges uniformly in probability to Q 0 (θ) if sup Q(θ; X n ) Q 0 (θ) P(θ 0) 0 θ Θ More precisely, we have that for all ɛ > 0, P θ0 [sup Q(θ; X n ) Q 0 (θ) > ɛ] 0 θ Θ Why isn t pointwise convergence enough? Uniform convergence guarantees that for almost all realizations, the paths in θ are in the sleeve. This ensures that the maximum is close to θ 0. For pointwise convergence, we know that at each θ, most of the realizations are in the sleeve, but there is no guarantee that for another value of θ the same set of realizations are in the sleeve. Thus, the maximum need not be near θ / 1

11 Consistency Theorem Theorem 4.2: Suppose that Q(θ; X n ) is continuous in θ and there exists a function Q 0 (θ) such that 1 Q 0 (θ) is uniquely maximized at θ 0 2 Θ is compact 3 Q 0 (θ) is continuous in θ 4 Q(θ; X n ) converges uniformly in probability to Q 0 (θ). then ˆθ(X n ) defined as the value of θ Θ which for each X n = x n maximizes the objective function Q(θ; X n ) satisfies ˆθ(X n ) P θ / 1

12 Consistency Proof Proof: We need to show that for all ɛ > 0, P θ0 [ ˆθ(X n ) θ 0 < ɛ] 1 as n. Another way of saying this is as follows. Define an ɛ-neighborhood about θ 0, i.e., We want to show that Θ(ɛ) = {θ : θ θ 0 < ɛ} P θ0 [ˆθ(X n ) Θ(ɛ)] 1 as n. Since Θ(ɛ) is an open set, we know that Θ Θ(ɛ) C is a compact set (Assumption 2). Since Q 0 (θ) is a continuous function (Assumption 3), then sup θ Θ Θ(ɛ) C {Q 0 (θ)} is a achieved for a θ in the compact set. Denote this value by θ. That is, Q 0 (θ ) = sup θ Θ Θ(ɛ) C {Q 0 (θ)}. 12 / 1

13 Consistency Proof We know that θ 0 is the unique maximizer of Q 0 (θ) over all θ Θ (Assumption 1). Therefore, Q 0 (θ 0 ) Q 0 (θ ) = δ > 0. Let A n be the event that Q(θ 0 ; X n ) > Q 0 (θ 0 ) δ/2 and B n the event that sup θ Θ Θ(ɛ) C Q(θ; X n ) Q 0 (θ) < δ/2. By uniform convergence in probability (Assumption 4), we know that a. P θ0 [A n ] 1. b. P θ0 [B n ] 1 c. P θ0 [A n B n ] 1 B n implies that for all θ Θ Θ(ɛ) C, Q(θ; X n ) < Q 0 (θ) + δ/2 Q(θ; X n ) < Q 0 (θ ) + δ/2 Q(θ; X n ) < Q 0 (θ 0 ) (Q 0 (θ 0 ) Q 0 (θ )) + δ/2 Q(θ; X n ) < Q 0 (θ 0 ) δ/2 13 / 1

14 Consistency Proof Let C n be the event that Q(θ; X n ) < Q 0 (θ 0 ) δ/2 for all θ Θ Θ(ɛ) C. Since B n C n, we know that P θ0 [C n ] 1. Since A n B n A n C n, we know that P θ0 [A n C n ] 1. The event A n C n implies that Q(θ 0 ; X n ) > sup θ Θ Θ(ɛ) C {Q(θ; X n )}. Since Q(ˆθ(X n ); X n ) Q(θ 0 ; X n ), we know that Q(ˆθ(X n ); X n ) > sup θ Θ Θ(ɛ) C {Q(θ; X n )}. This implies that ˆθ(X n ) Θ(ɛ). Thus, we know that P θ0 [ˆθ(X n ) Θ(ɛ)] / 1

15 Lemma 4.3 A key element of the above proof is that Q(θ; X n ) converges uniformly in probability to Q 0 (θ). This is often difficult to prove. A useful condition is given by the following lemma: Lemma 4.3: If X 1,..., X n are i.i.d. p(x; θ 0 ) {p(x; θ) : θ Θ}, Θ is compact, log p(x; θ) is continuous in θ for all θ Θ and all x X, and if there exists a function d(x) such that log p(x; θ) d(x) for all θ Θ and x X, and E θ0 [d(x )] <, then i. Q 0 (θ) = E θ0 [log p(x ; θ)] is continuous in θ ii. sup θ Θ Q(θ; X n ) Q 0 (θ) P 0 15 / 1

16 Proof Lemma 4.3 Proof: We first prove the continuity of Q 0 (θ). For any θ Θ, choose a sequence θ k Θ which converges to θ. By the continuity of log p(x; θ), we know that log p(x; θ k ) log p(x; θ). Since log p(x; θ k ) d(x), the dominated convergence theorem tells us that Q 0 (θ k ) = E θ0 [log p(x ; θ k )] E θ0 [log p(x ; θ)] = Q 0 (θ). This implies that Q 0 (θ) is continuous. Next, we work to establish the uniform convergence in probability. We need to show that for any ɛ, η > 0 there exists N(ɛ, η) such that for all n > N(ɛ, η), P[sup Q(θ; X n ) Q 0 (θ) > ɛ] < η θ Θ 16 / 1

17 Proof Lemma 4.3 Since log p(x; θ) is continuous in θ Θ and since Θ is compact, we know that log p(x; θ) is uniformly continuous (see Rudin, page 90). Uniform continuity says that for all ɛ > 0 there exists a δ(ɛ) > 0 such that log p(x; θ 1 ) log p(x; θ 2 ) < ɛ for all θ 1, θ 2 Θ for which θ 1 θ 2 < δ(ɛ). This is a property of a function defined on a set of points. In contrast, continuity is defined relative to a particular point. Continuity of a function at a point θ says that for all ɛ > 0, there exists a δ(ɛ, θ ) such that log p(x; θ) log p(x; θ ) < ɛ for all θ Θ for which θ θ < δ(ɛ, θ ). For continuity, δ depends on ɛ and θ. For uniform continuity, δ depends only on ɛ. In general, uniform continuity is stronger than continuity; however, they are equivalent on compact sets. 17 / 1

18 Aside: Uniform Continuity vs. Continuity Consider f (x) = 1/x for x (0, 1). This function is continuous for each x (0, 1). However, it is not uniformly continuous. Suppose this function was uniformly continuous. Then, we know for any ɛ > 0, we can find a δ(ɛ) > 0 such that 1/x 1 1/x 2 < ɛ for all x 1, x 2 such that x 1 x 2 < δ(ɛ). Given an ɛ > 0, consider the points x 1 and x 2 = x 1 + δ(ɛ)/2. Then, we know that 1 x 1 1 x 2 = 1 x 1 1 x 1 + δ(ɛ)/2 = δ(ɛ)/2 x 1 (x 1 + δ(ɛ)/2) Take x 1 sufficiently small so that x 1 < min( δ(ɛ) 2, 1 2ɛ ). This implies that δ(ɛ)/2 x 1 (x 1 + δ(ɛ)/2) > δ(ɛ)/2 x 1 δ(ɛ) = 1 > ɛ 2x 1 This is a contradiction. 18 / 1

19 Proof Lemma 4.3 Uniform continuity also implies that (x, δ) = sup log p(x; θ 1 ) log p(x; θ 2 ) 0 {(θ 1,θ 2 ): θ 1 θ 2 <δ} as δ 0. By the assumption of Lemma 4.3, we know that (x, δ) 2d(x) for all δ. By the dominated convergence theorem, we know that E θ0 [ (X, δ)] 0 as δ 0. Now, consider open balls of length δ about each θ Θ, i.e., B(θ, δ) = { θ : θ θ < δ}. The union of these open balls contains Θ. This union is an open cover of Θ. Since Θ is a compact set, we know that there exists a finite subcover, which we denote by {B(θ j, δ), j = 1,..., J}. 19 / 1

20 Proof Lemma 4.3 By the triangle inequality, we know that Q(θ; X n ) Q 0 (θ) Q(θ; X n ) Q(θ j ; X n ) + (1) Q(θ j ; X n ) Q 0 (θ j ) + (2) Q 0 (θ j ) Q 0 (θ) (3) Choose θ j so that θ B(θ j ; δ). Since θ θ j < δ, we know that (1) is equal to 1 n n {log p(x i ; θ) log p(x i ; θ j )} i=1 and this is less than or equal 1 n n log p(x i ; θ) log p(x i ; θ j ) 1 n i=1 n (X i, δ) i=1 20 / 1

21 Proof Lemma 4.3 We also know that (3) is less than or equal to sup Q 0 (θ 1 ) Q 0 (θ 2 ) {(θ 1,θ 2 ): θ 1 θ 2 <δ} Since Q 0 (θ) is uniformly continuous, we know that this bound can be made arbitrarily small by choosing δ to be small. That is, this bound can be made less than ɛ (δ), where ɛ(δ) 0 as δ 0. We also know that (2) is less than max Q(θ j; X n ) Q 0 (θ j ) j=1,...,j Putting these results together, we have that sup Q(θ; X n ) Q 0 (θ) 1 θ Θ n n i=1 (X i, δ)+ max j=1,...,j Q(θ j; X n ) Q 0 (θ j ) +ɛ (δ) 21 / 1

22 Proof Lemma 4.3 Choose δ so that ɛ (δ) < ɛ/3. Call this δ 1. So for any δ < δ 1, we know that P θ0 [sup θ Θ Q(θ; X n ) Q 0 (θ) > ɛ] P θ0 [ 1 n P θ0 [ 1 n n i=1 (X i, δ) + max j=1,...,j Q(θ j; X n ) Q 0 (θ j ) > 2ɛ/3] n (X i, δ) > ɛ/3] + (4) i=1 P θ0 [ max j=1,...,j Q(θ j; X n ) Q 0 (θ j ) > ɛ/3] (5) Now, we can show that we can take n sufficiently large so that (4) and (5) can be made small. 22 / 1

23 Proof Lemma 4.3 Note that (4) is equal to P θ0 [ 1 n n { (X i, δ) E θ0 [ (X, δ)]} + E θ0 [ (X, δ)] > ɛ/3] i=1 We already have demonstrated that E θ0 [ (X, δ)] 0 as δ 0. Choose δ small enough that E θ0 [ (X, δ)] < ɛ/6. Call this number δ 2. Take δ < min(δ 1, δ 2 ). Then (4) is less than P θ0 [ 1 n n { (X i, δ) E θ0 [ (X, δ)} > ɛ/6] i=1 By the WLLN, we know that there exists N 1 (ɛ, η) so that for all n > N 1 (ɛ, η), the above term is less than η/2. So, (4) is less than η/2. 23 / 1

24 Proof Lemma 4.3 For the δ considered so far, find the finite subcover {B(θ j, δ), j = 1,..., J}. Now, (5) is equal to P θ0 [ J j=1{ Q(θ j ; X n ) Q 0 (θ j ) > ɛ/3}] J P θ0 [ Q(θ j ; X n ) Q 0 (θ j ) > ɛ/3] By the WLLN, we know that for each θ j and for any ɛ, η > 0, we know there exists N 2j (ɛ, η) so that for all n > N 2j (ɛ, η) j=1 P θ0 [ Q(θ j ; X n ) Q 0 (θ j ) > ɛ/3] η/2j Let N 2 (ɛ, η) = max j=1,...,j {N 2j }. Then, for all n > N 2 (ɛ, η), we know that J P θ0 [ Q(θ j ; X n ) Q 0 (θ j ) > ɛ/3] < η/2 j=1 This implies that (5) is less than η/2. 24 / 1

25 Proof Lemma 4.3 Combining the results for (4) and (5), we have demonstrated that there exists an N(ɛ, η) = max(n 1 (ɛ, η), N 2 (ɛ, η)) so that for all n > N(ɛ, η), P θ0 [sup Q(θ; X n ) Q 0 (θ) > ɛ] < η θ Θ Q.E.D. PHEW! 25 / 1

26 Consistency Other conditions that might be useful to establish uniform convergence in probability are given in the lemmas below. Lemma 4.4 may be useful when the data are not independent. Lemma 4.4: If Θ is compact, Q 0 (θ) is continuous in θ Θ, Q(θ; X n ) P Q 0 (θ) for all θ Θ, and there is an α > 0 and C(X n ) which is bounded in probability such that for all θ, θ Θ, then, Q( θ, X n ) Q(θ, X n ) C(X n ) θ θ α sup Q(θ; X n ) Q 0 (θ) P 0 θ Θ Aside: C(X n ) is bounded in probability if for every ɛ > 0 there exists N(ɛ) and η(ɛ) > 0 such that P θ0 [ C(X n ) > η(ɛ)] < ɛ for all n > N(ɛ). 26 / 1

27 Θ is Not Compact Suppose that we are not willing to assume that Θ is compact. One way around this is bound the objective function from above uniformly in parameters that are far away from the truth. For example, suppose that there is a compact set D such that E θ0 [ sup {log p(x ; θ)}] < Q 0 (θ 0 ) = E θ0 [log p(x ; θ 0 )] θ Θ D C By the law of large numbers, we know that with probability approaching one that sup θ Θ D C {Q(θ; X n )} 1 n < 1 n n sup log p(x i ; θ) θ Θ D C n log p(x i ; θ 0 ) = Q(θ 0 ; X n ) i=1 i=1 27 / 1

28 Θ is Not Compact Therefore, with probability approaching one, we know that ˆθ(X n ) is in the compact set D. Then, we can apply the previous theory to show consistency. The following lemma can be used in cases where the objective function is concave. Lemma 4.5: If there is a function Q 0 (θ) such that i. Q 0 (θ) is uniquely maximized at θ 0 ii. θ 0 is an element of the interior of a convex set Θ (does not have to be bounded) iii. Q(θ, x n ) is concave in θ for each x n, and iv. Q(θ; X n ) P Q 0 (θ) for all θ Θ then ˆθ(X n ) exists with probability approaching one and ˆθ(X n ) P θ / 1

29 Asymptotic Normality We assume that X n = (X 1,..., X n ), where the X i s are i.i.d. with common density p(x; θ 0 ) P = {p(x; θ) : θ Θ R k }. We assume that θ 0 is identified in the sense that if θ θ 0 and θ Θ, then p(x; θ) p(x; θ 0 ) with respect to the dominating measure µ. In order to prove asymptotic normality, we will need certain regularity conditions. Some of these were encountered in the proof of consistency, but we will need some additional assumptions. 29 / 1

30 Regularity Conditions i. θ 0 lies in the interior of Θ, which is assumed to be a compact subset of R k. ii. log p(x; θ) is continuous at each θ Θ for all x X (a.e. will suffice). iii. log p(x; θ) d(x) for all θ Θ and E θ0 [d(x )] <. iv. p(x; θ) is twice continuously differentiable and p(x; θ) > 0 in a neighborhood, N, of θ 0. v. p(x;θ) θ e(x) for all θ N and e(x)dµ(x) <. 30 / 1

31 Regularity Conditions vi. Defining the score vector ψ(x; θ) = ( log p(x; θ)/ θ 1,..., log p(x; θ)/ θ k ) then we assume that I (θ 0 ) = E θ0 [ψ(x ; θ 0 )ψ(x ; θ 0 ) ] exists and is non-singular. vii. 2 log p(x;θ) θ θ f (x) for all θ N and E θ0 [f (X )] <. viii. 2 p(x;θ) θ θ g(x) for all θ N and g(x)dµ(x) <. 31 / 1

32 Asymptotic Normality Theorem 4.6: If these 8 regularity conditions hold, then n(ˆθ(x n ) θ 0 ) D(θ 0) N(0, I 1 (θ 0 )) 32 / 1

33 Asymptotic Normality Proof Proof: Note that conditions i. - iii. guarantee that the MLE is consistent. Since θ 0 is assumed to lie in the interior of Θ, we know that with sufficiently large probability that the MLE will lie in N and cannot be on the boundary. This implies that the maximum is also a local maximum, which implies that Q(ˆθ(X n ); X n )/ θ = 0 or 1 n n i=1 ψ(x i; ˆθ(X n )) = 0. That is, the MLE is the solution to the score equations. Condition v. guarantees that we can interchange integration and differentiation. Consider the case where k = 1. We know that 1 = p(x; θ)dµ(x) for all θ N. This implies that 0 = d dθ p(x; θ)dµ(x). Let s show that d dθ p(x; θ)dµ(x) = d dθ p(x; θ)dµ(x). Choose a sequence θ n N such that θ n θ. Then, by definition of a derivative, we know that dp(x; θ) dθ = lim {p(x; θ n) p(x; θ) } for all x X n θ n θ 33 / 1

34 Asymptotic Normality Proof By the mean value theorem, we know that p(x; θ n ) = p(x; θ) + dp(x; θ n) (θ n θ) dθ where θ n lies between θ and θ n so that θ n N. This implies that p(x; θ n) p(x; θ) θ n θ = dp(x; θ n) e(x) dθ Since e(x) is integrable, we can employ the dominated convergence theorem. This says that 0 = d p(x; θ)dµ(x) = lim { p(x; θ n) p(x; θ) }dµ(x) dθ n θ n θ = lim {p(x; θ n) p(x; θ) }dµ(x) n θ n θ dp(x; θ) = dµ(x) dθ 34 / 1

35 Asymptotic Normality Proof This can be generalized to partial derivatives which can then be used to formally show that E θ [ψ(x ; θ)] = 0 for θ N. We know that p(x; θ)dµ(x) = 1. This implies that θ j p(x; θ)dµ(x) = 0. By dominated convergence, we can interchange differentiation and integration so that p(x;θ) θ j dµ(x) = 0. Then, we know that p(x; θ)/ θj p(x; θ)dµ(x) = 0 p(x; θ) We can divide by p(x; θ) since it is greater than zero for all θ N. This implies that E θ [ψ j (X ; θ)] = / 1

36 Asymptotic Normality Proof The random vectors ψ(x 1 ; θ 0 ),... ψ(x n ; θ 0 ) are i.i.d. mean zero random vectors with covariance matrix I (θ 0 ). By the multivariate central limit theorem, we know that 1 n n i=1 ψ(x i ; θ 0 ) D(θ 0) N(0, I (θ 0 )) Now we shall study the large sample behavior of the matrix of second partial derivatives of the log-likelihood. Define J n (θ) = [ 1 n n i=1 2 log p(x i ; θ) θ θ ] This is a k k random matrix. 36 / 1

37 Asymptotic Normality Proof We first want to demonstrate that E θ0 [J n (θ 0 )] = I (θ 0 ). We have already established that θ j p(x; θ)dµ(x) = 0 for all j = 1,..., k. This implies that p(x; θ)dµ(x) = 0 for all j, j = 1,..., k. θ j θ j Since the norm of the matrix of second partial derivatives of p(x; θ) is bounded by an integrable function g(x) (see condition viii), we know that we can interchange integration and differentiation. This implies that 2 p(x; θ) θ j θ j dµ(x) = 0 for all j, j = 1,..., k. We note, however, that 2 log p(x; θ) θ j θ j = 2 p(x; θ) θ j θ j /p(x; θ) ψ j (x; θ)ψ j (x; θ) 37 / 1

38 Asymptotic Normality Proof This implies that 2 p(x; θ) = p(x; θ)[ 2 log p(x; θ) + ψ j (x; θ)ψ j (x; θ)] θ j θ j θ j θ j So, 0 = 2 p(x; θ 0 ) θ j θ j dµ(x) = p(x; θ 0 ) 2 log p(x; θ 0 ) θ j θ j dµ(x) + p(x; θ 0 )ψ j (x; θ 0 )ψ j (x; θ 0 )dµ(x) This implies that I j j(θ 0 ) = E θ0 [ 2 log p(x ;θ 0 ) θ j θ j ]. In matrix form, this says that I (θ 0 ) = E θ0 [ 2 log p(x ;θ 0 ) θ θ ]. So, E θ0 [J n (θ 0 )] = I (θ 0 ). 38 / 1

39 Asymptotic Normality Proof By the weak law of large numbers, we know that J n (θ) P E θ0 [ 2 log p(x ; θ) θ θ ] Define I 0 (θ) = E θ 0 [ 2 log p(x ;θ) θ θ ]. Note that I 0 (θ 0) = I (θ 0 ). The WLLN enables us to show that J n (θ) P I0 (θ) for all θ N. This is pointwise convergence. However, as we will see shortly, this will not suffice to prove our desired results. We will need something stronger. We will have to show that J n (θ) converges uniformly in probability to I0 (θ) for θ N. The proof follows the exact same way as it did when we proved uniform convergence in probability for consistency. 39 / 1

40 Asymptotic Normality Proof We know that 2 log p(x;θ) θ θ is continuous in θ N. This follows by assumption iv. By condition vii., we know that 2 log p(x;θ) θ θ f (x) for θ N and E θ0 [f (X )] <. By the same proof of Lemma 4.3, we know that a. I0 (θ) is continuous for θ N and b. sup θ N J n (θ) I0 (θ) P / 1

41 Lemma 4.7 This result is important because it allows us to prove the following lemma: Lemma 4.7 If θ n(x n ) is a consistent estimator for θ 0, then J n (θ n(x n )) P I (θ 0 ). 41 / 1

42 Proof of Lemma 4.7 Proof: By the triangle inequality, we know that J n (θ (X n )) I (θ 0 ) J n (θ (X n )) I 0 (θ (X n )) + I 0 (θ (X n )) I (θ 0 ) So, it suffices to prove that 1. J n (θ (X n )) I 0 (θ (X n )) P 0 and 2. I 0 (θ (X n )) I (θ 0 ) P 0. Since I0 (θ) is a continuous function of θ N, we know that I0 (θ (X n )) P I (θ 0 ). This is because a continuous function of a consistent estimator converges in probability to the function of the estimand. So, 2 is true. In order to prove 1, we first note that with arbitrarily large probability and n sufficiently large, θ n(x n ) N. When this happens, we have that J n (θn(x n )) I0 (θn(x n )) sup J n (θ) I0 (θ) P 0 θ N 42 / 1

43 Asymptotic Normality Proof Now, we have all the pieces required to establish asymptotic normality. By the mean value theorem, applied to each element of the score vector, we have that 0 = 1 n n i=1 ψ(x i ; ˆθ(X n )) = 1 n n ψ(x i ; θ 0 )+{ Jn(X n )} n(ˆθ(x n ) θ 0 ) i=1 Note that J n(x n ) is a k k random matrix where the jth row of the matrix is the jth row of J n evaluated at θ jn (X n) where θ jn (X n) is an intermediate value between ˆθ(X n ) and θ 0. θ jn (X n) may be different from row to row but it will be consistent for θ 0. Therefore, J n(x n ) P I (θ 0 ). 43 / 1

44 Asymptotic Normality Proof By assumption vi., we know that I (θ 0 ) is non-singular. The inversion of a non-singular matrix is a continuous function in θ. Since J n(x n ) P I (θ 0 ), we know that {J n(x n )} 1 P I (θ 0 ) 1. This also means that with sufficiently large probability, as n gets large, J n(x n ) is invertible. Therefore, we know that n(ˆθ(x n ) θ 0 ) = {J n(x n )} 1 1 n Employing Slutsky s theorem, we know that n ψ(x i ; θ 0 ) i=1 n(ˆθ(x n ) θ 0 ) D N(0, I (θ 0 ) 1 ) 44 / 1

Section 8: Asymptotic Properties of the MLE

2 Section 8: Asymptotic Properties of the MLE In this part of the course, we will consider the asymptotic properties of the maximum likelihood estimator. In particular, we will study issues of consistency,