Section 9: Generalized method of moments

Size: px

Start display at page:

Download "Section 9: Generalized method of moments"

Marcus Singleton
5 years ago
Views:

1 1 Section 9: Generalized method of moments In this section, we revisit unbiased estimating functions to study a more general framework for estimating parameters. Let X n =(X 1,...,X n ), where the X i s are i.i.d. with density p(x; θ 0 ) P= {p(x; θ) :θ Θ}. We assume that θ =(γ,λ), Θ=Γ Λ, where Γ R k and Λ is some appropriately defined space. We are interested in estimating γ. Suppose that there exists a p-dimensional (p k) vector of

2 2 estimating functions g(x; γ) = g 1 (x; γ) g 2 (x; γ)... g p (x; γ) such that E θ [g(x; γ)] = 0 for all θ Θ.

3 3 Economists usually consider situations in which p>k.we usually consider p = k. A generalized method of moments (GMM) estimator is one that minimizes a squared Euclidean distance of sample moments from their population counterparts. A GMM estimator ˆγ(X n )isthe values of γ which maximizes Q(γ; X n )= [ 1 n ] [ n 1 g(x i ; γ) Ŵ n n i=1 ] n g(x i ; γ) where Ŵ n P W, a non-random, positive semi-definite matrix. i=1 (1)

4 4 The maximum likelihood estimator is a specific example of a GMM estimator. Under regularity conditions, we know that the MLE, ˆγ(X n ), satisfies, 1 n n i=1 ψ(x i;ˆγ(x n )) = 0. So, if we take g(x; γ) =ψ(x; γ) (Herep = k) andŵ n = I, then Q(γ; X n )= [ 1 n ] [ n 1 ψ(x i ; γ) n i=1 ] n ψ(x i ; γ) has maximum value at zero which is obtained at the MLE. i=1

5 5 In cases where p = k, the GMM estimator can usually be found by solving the k equations and k unknown problem 1 n n g(x i ; γ) =0 i=1 That is, the quantity Q(γ; X n ) (which is less than or equal to zero) can be made identically equal to zero. Such estimators are referred to as M-estimators.

6 6 Example 9.1 Suppose X n =(X 1,...,X n ), where the X i s are i.i.d. with c.d.f. F 0 P= {F : F is continuously, differentiable}. Letγ = F 1 (0.5) and γ 0 = F 1 0 (0.5). Define the estimating function g(x; γ) =I(X γ) 0.5 Note that E F [g(x; γ)] = 0 for all F P. The corresponding M-estimator is the sample median.

7 7 Example 9.2 Assume that the data are i.i.d. random vectors (Y 1,X 1 ),...,(Y n,x n )withe θ0 [Y i X i ]=µ(x i ; γ 0 ). Let s assume that Y is a q-dimensional vector, and X is r-dimensional. More formally, we are assuming that (Y i,x i ) p(y, x; θ 0 ) P= {p(y, x; θ) : yp(y x; θ) =µ(x; γ)}. Example 9.2a: Linear Regression (Take q = r =1andk =2) E θ [Y X] =γ 1 X = γ 0 + γ 1 X

8 8 Example 9.2b: Multivariate Linear Regression E θ [Y X] =γ 1 X = γ 10 + γ 11 X γ 1r X r... γ q0 + γ q1 X γ qr X r Example 9.2c: Logistic Regression (Y is binary) E θ [Y X] = exp(γ 0 + γ 1 X γ r X r ) 1+exp(γ 0 + γ 1 X γ r X r )

9 9 Consider the following moment function: g(y i,x i ; γ) =A(X i ; γ)(y i µ(x i ; γ)) where A(X i ; γ) isk q matrix which is a function of γ and X i. Now, note that for all θ. E θ [g(y i,x i ; γ)] = E θ [E θ [g(y i,x i ; γ) X i ]] The solution to the following equation: = E θ [A(X i ; γ)e θ [Y i µ(x i ; γ) X i ]=0 1 n n A(X i ; γ)(y i µ(x i ; γ)) = 0 i=1 is called the GEE estimator.

10 10 Example 9.3 Suppose X n =(X 1,...,X n ), where X i =(Y i,w i,z i ) s are i.i.d. We assume Y is scalar, W is k-dimensional, and Z is p-dimensional (p k). Further, assume that Y = W γ 0 + ɛ, where Y =(Y 1,...,Y n ), W =(W 1,...,W n), ɛ =(ɛ 1,...,ɛ n ), ɛ i s are i.i.d. with variance σ 2 0,andE θ0 [ɛ i W i ]=0. Consider the following k-dimensional estimating function: g(x; γ) =W (Y W γ) Note that E θ [g(x; γ)] has mean zero because E θ [ɛ W ]=0. The corresponding M-estimator is the least squares estimator.

11 11 Suppose that E θ [ɛ i W i ] 0, but E θ [ɛ i Z i ] = 0. Then, consider the following p-dimensional estimating function: g(x; γ) =Z(Y W γ) Note that E θ [g(x; γ)] has mean zero because E θ [ɛ Z] =0. Z is called an instrumental variable. Economists try to find these variables, because they often deal with situations where W and ɛ are correlated. When p>k,weestimateγ by maximizing the quadratic form, given by (1).

12 12 As with MLE s we shall study the large sample properties of GMM estimators. This will include a study of consistency, asymptotic normality, and efficiency. We first note that Q(γ; X n ), given by (1), is minus a quadratic form, which is a random function of γ. It converges in probability to a deterministic function of γ: Q 0 (γ) = q 0 (γ) W q 0 (γ) 0 (2) where q 0 (γ) =E θ0 [g(x; γ)]. By the WLLN, pointwise convergence for each γ is straightforward.

13 13 We start with the issue of consistency. The approach we shall take is to try and mimic the proof we used to establish the consistency of the MLE. That is, we will consider Q(γ; X n ) to be an objective function which we would like to maximize. Our intuition says that the maximum of Q(γ; X n ) should converge to the maximum of Q 0 (γ). Just like we did previously, we would like to establish conditions under which Q 0 (γ) has a unique maximum at γ 0. We know that q 0 (γ 0 ) = 0. Since Q 0 (γ) is minus a quadratic form, we know that its maximum is achieved at γ 0. We must establish conditions under which this maximum is unique. In other words, we want to find conditions so that Q 0 (γ) 0whenγ γ 0.

14 14 Lemma 9.1: If W is positive semi-definite and W q 0 (γ) 0for γ γ 0,thenQ 0 (γ) = q 0 (γ) W q 0 (γ) has a unique maximum at γ 0. Proof: Since W is positive semi-definite, we know that there exists a possibly singular matrix R such that R R = W (see page 257 of Strang, 1980). If γ γ 0,thenW q 0 (γ) =R Rq 0 (γ) 0. This implies that Rq 0 (γ) 0. Hence, Q 0 (γ) = [Rq 0 (γ)] [Rq 0 (γ)] <Q 0 (γ 0 )=0

15 15 In some cases it may be difficult to show that we have a unique maximum at γ 0 for Q 0 (γ). This is especially true when we are doing M-estimation (p = k). That is, there may be many solutions to the k equations and k unknowns, especially when we have complicated non-linear equations. If W is positive definite, then a unique maximum is guaranteed if q 0 (γ) 0 whenever γ γ 0. By increasing the dimension p of the moment function, we might make this more likely to happen. This must be studied on a case by case basis.

16 16 Back to Example 9.1 q 0 (γ) =E F0 [I(X γ)] 0.5 =F 0 (γ) 0.5 Since F 0 is continuous, q 0 (γ) = 0 if and only if γ = γ 0. Back to Example 9.3 Suppose E θ0 [ɛ W ]=0andE θ0 [WW ] is finite and positive definite. When g(x; γ) =W (Y W γ), we know that q 0 (γ) = E θ0 [W (Y W γ)] = E θ0 [W (W γ 0 W γ + ɛ)] = E θ0 [WW ](γ 0 γ) This equals zero if and only if γ = γ 0.

17 17 Suppose E θ0 [ɛ W ] 0,E θ0 [ɛ Z] =0andE θ0 [ZW ] is finite and full rank. When g(x; γ) =Z(Y W γ), we know that q 0 (γ) = E θ0 [Z(Y W γ)] = E θ0 [Z(W γ 0 W γ + ɛ)] = E θ0 [ZW ](γ 0 γ) This equals zero if and only if γ = γ 0.

18 18 Theorem 9.2: Suppose that X n =(X 1,...,X n ), where the X i s are i.i.d. with density p(x; θ 0 ) P= {p(x; θ) :θ Θ}. Assume that θ =(γ,λ),θ=γ Λ, where Γ R k and Λ is some appropriately defined space. In addition, suppose that i. Ŵ n P W ii. W is a positive definite and q 0 (γ) =E θ0 [g(x; γ)] = 0 only if γ = γ 0 (unique maximum at γ = γ 0 ). iii. Γ is compact. iv. g(x; γ) is continuous in γ Γ, for all x X. v. g(x; γ) d(x) for all γ ΓandE θ0 [d(x)] <. Then ˆγ(X n ), the estimator which maximizes Q(γ; X n )givenby (1) converges in probability to γ 0. Proof: The proof is almost identical to that used to prove the consistency of the MLE. That is, if we show that the objective

19 19 function Q(γ; X n ) converges uniformly in probability to Q 0 (γ) given by (2) and the fact that Q 0 (γ) is uniquely maximized at γ 0, then we can use Theorem 8.2 to prove consistency. To use Theorem 8.2, we must show that a. Q(γ; X n ) is continuous in γ. b. Q 0 (γ) is continuous in γ. c. sup γ Γ Q(γ; X n ) Q 0 (γ) P 0. Condition a. is implied by the continuity of g(x; γ) inγ ( Condition iv. above). By the continuity of g(x; γ) inγ and the fact that g(x; γ) is dominated uniformly by an integrable function (Conditions iv. and v. above), we can use Lemma 8.3 to show that d. q 0 (γ) is continuous in γ. e. sup γ Γ ˆq(γ; X n ) q 0 (γ) P 0

20 20 where ˆq(γ; X n )= 1 n n i=1 g(x i; γ). Condition d. will imply Condition b. Therefore, we are left to show that Condition c. holds. Adding and subtracting terms, we know that Q(γ; X n ) Q 0 (γ) = ˆq(γ; X n ) Ŵ nˆq(γ; X n )+q 0 (γ) W q 0 (γ) = [ˆq(γ; X n ) q 0 (γ)] Ŵ n [ˆq(γ; X n ) q 0 (γ)] q 0 (γ) [Ŵn + Ŵ n][ˆq(γ; X n ) q 0 (γ)] q 0 (γ) [Ŵ n W ]q 0 (γ)

21 21 By the triangle inequality, we know that Q(γ; X n ) Q 0 (γ) [ˆq(γ; X n ) q 0 (γ)] Ŵ n [ˆq(γ; X n ) q 0 (γ)] + q 0 (γ) [Ŵn + Ŵ n][ˆq(γ; X n ) q 0 (γ)] + q 0 (γ) [Ŵn W ]q 0 (γ) ˆq(γ; X n ) q 0 (γ) 2 Ŵn + 2 q 0 (γ) Ŵn ˆq(γ; X n ) q 0 (γ) + q 0 (γ) 2 Ŵn W So, sup Q(γ; X n ) Q 0 (γ) sup ˆq(γ; X n ) q 0 (γ) 2 Ŵn + (3) γ Γ γ Γ 2sup q 0 (γ) Ŵ n sup ˆq(γ; X n ) q 0 (γ) +(4) γ Γ γ Γ sup q 0 (γ) 2 Ŵn W (5) γ Γ

22 22 Now we can show that each of the terms on the RHS converges in probability to zero, which will complete the proof. First, we know that Ŵn W P, which we assume is fixed and bounded. We also know that sup γ Γ ˆq(γ; X n ) q 0 (γ) P 0. Since q 0 (γ) isa continuous function on a compact set, then we know that sup γ Γ q 0 (γ) is bounded. These facts imply that (3), (4) and (5) converge in probability to zero. Thus, we have uniform convergence in probability of Q(γ; X n )toq 0 (γ).

23 23 Asymptotic Normality of the GMM Estimator After establishing the consistency of the GMM estimator, we are now in the position to prove asymptotic normality. The proof will follow that given for the MLE. Before we embark on this proof, we need some preliminaries. First, we will take γ 0 to be in the interior of a compact set, Γ. Since ˆγ(X n ) is consistent for γ 0,themaximumofQ(γ; X n ) will, with probability approaching one, be a local maximum. That is, Q(ˆγ(X n ); X n ) γ =0

24 24 Since Q(γ; X n )= ˆq(γ; X n ) Ŵ n ˆq(γ; X n ), we know that Q(γ; X n ) γ where So, = 2 ˆq(γ; X n) Ŵn ˆq(γ; X n )= 2 γ ˆD(γ; X n ) Ŵ nˆq(γ; X n ) ˆD(γ; X n )= ˆq(γ; X n) γ = 1 n n i=1 g(x i ; γ) γ 2 ˆD(ˆγ(X n ); X n ) Ŵ nˆq(ˆγ(x n ); X n )=0

25 25 Theorem 9.3 Suppose that X n =(X 1,...,X n ), where the X i s are i.i.d. with density p(x; θ 0 ) P= {p(x; θ) :θ Θ}. Assume that θ =(γ,λ),θ=γ Λ, where Γ R k and Λ is some appropriately defined space. In addition, suppose that following 8 regularity conditions hold: i. Ŵ n P W ii. γ 0 is in the interior of Γ, which is assumed to be compact. iii. W is a positive definite and q 0 (γ) =E θ0 [g(x; γ)] = 0 only if γ = γ 0. iv. g(x; γ) is continuous in γ Γ, for all x X. v. g(x; γ) d(x) for all γ ΓandE θ0 [d(x)] <.

26 26 vi. g(x; γ) is continuously differentiable in a neighborhood, N of γ 0. vii. g(x;γ) γ f(x) for all γ N and E θ0 [f(x)] <. viii. D WD is non-singular, where D = E θ0 [ g(x;γ 0) γ ] then, n(ˆγ(xn ) γ 0 ) D N(0, (D WD) 1 D W ΩW D(D WD) 1 ) where Ω = E θ0 [g(x; γ 0 )g(x; γ 0 ) ].

27 27 Proof: We know that with probability approaching one that the GMM estimator, ˆγ(X n ) satisfies the equation: ˆD(ˆγ(X n ); X n ) Ŵ nˆq(ˆγ(x n ); X n )=0 (6) Expanding ˆq(ˆγ(X n ); X n )aboutγ 0 yields ˆq(ˆγ(X n ); X n )=ˆq(γ 0 ; X n )+D n(x n )(ˆγ(X n ) γ 0 ) (7) where D (X n )isap k random matrix where the jth row is the jth row of ˆD(γ; X n ) evaluated at some intermediate value γjn between ˆγ(X n )andγ 0. γjn may be different from row to row, but it is still consistent for γ 0. By Conditions vi. and vii., we can invoke Lemma 8.3, to show sup γ N ˆD(γ; X n ) D 0 (γ) P 0

28 28 where D 0 (γ) =E θ0 [ g(x;γ) γ ]andd 0 (γ 0 )=D. Sinceγjn N with probability one, we know that Dn(X n ) P D. will be in Plugging (7) into (6), we have that or ˆD(ˆγ(X n ); X n ) Ŵ n {ˆq(γ 0 ; X n )+D n(x n )(ˆγ(X n ) γ 0 )} =0 ˆD(ˆγ(X n ); X n ) Ŵ n ˆq(γ 0 ; X n )+ ˆD(ˆγ(X n ); X n ) Ŵ n D n(x n )(ˆγ(X n ) γ 0 )} =0 This implies that n(ˆγ(xn ) γ 0 )= { ˆD(ˆγ(X n ); X n ) Ŵ n D n (X n)} 1 ˆD(ˆγ(Xn ); X n ) Ŵ n nˆq(γ 0 ; X n )

29 29 Note that nˆq(γ0 ; X n ) D N(0, Ω) ˆD(ˆγ(X n ); X n ) Ŵ n Dn(X n ) P D WD { ˆD(ˆγ(X n ); X n ) Ŵ n Dn(X n )} 1 P (D WD) 1 ˆD(ˆγ(X n ); X n ) P Ŵ n D W Using Slutsky s, these results imply that n(ˆγ(xn ) γ 0 ) D N(0, (D WD) 1 D W ΩW D(D WD) 1 )

30 30 Estimating the Asymptotic Variance Use Ŵ n for W. ˆD(ˆγ(X n ); X n )ford i=1 g(x i;ˆγ(x n ))g(x i ;ˆγ(X n )) for Ω. 1 n Special Case: p = k In the special case where the dimension of g(x; γ) isequaltok and D is a full rank square matrix, the asymptotic variance simplifies to be D 1 ΩD 1 or (D Ω 1 D) 1.

31 31 Back to Example 9.2 g(y,x; γ) =A(X; γ)(y µ(x; γ)) Here, D = E θ0 [ A(X; γ 0 ) µ(x; γ 0) γ + A(X; γ 0) γ (Y µ(x; γ 0 ))] = E θ0 [ A(X; γ 0 ) µ(x; γ 0) γ ] Ω = Var θ0 [E θ0 [g(y,x; γ 0 ) X]] + E θ0 [Var θ0 [g(y,x; γ 0 ) X]] = E θ0 [A(X; γ 0 )Var θ0 [Y X]A(X; γ 0 ) ] So, the asymptotic variance of the GEE estimator is given by E θ0 [A(X; γ) µ(x; γ 0 ) γ ] 1 E θ0 [A(X; γ 0 )Var θ0 [Y X]A(X; γ 0 ) ]E θ0 [A(X; γ) µ(x; γ 0 ) γ ] 1

32 32 An estimator for the asymptotic variance can be obtained by replacing the expectations by their empirical counterparts. That is, we replace E θ0 [A(X; γ 0 ) µ(x;γ 0) γ ]by 1 n n i=1 A(X i ;ˆγ(X n )) µ(x i;ˆγ(x n )) γ and E θ0 [A(X; γ 0 )Var θ0 [Y X]A(X; γ 0 ) ]by 1 n n A(X i ;ˆγ(X n ))(Y i µ(x i ;ˆγ(X n )))(Y i µ(x i ;ˆγ(X n ))) A(X i ;ˆγ(X n )) i=1

33 33 Why is this latter quantity a consistent estimator? We know that 1 n n A(X i ; γ 0 )(Y i µ(x i ; γ 0 ))(Y i µ(x i ; γ 0 )) A(X i ; γ 0 ) (8) i=1 converges to E θ0 [A(X; γ 0 )Var θ0 [Y X]A(X; γ 0 ) ]bythewlln.by additional smoothness conditions together with uniform bounding by an integrable function to show uniform convergence in probability, we can plug in a consistent estimator for γ 0 in (8) without altering the resulting probability limit.

34 34 Efficiency and GMM Estimators Let us first consider the problem where p>k. For a given moment function, we what to find the optimal choice of W.Thatis,what p p matrix W will minimize the asymptotic variance of n(ˆγ(xn ) γ 0 ). Note that for p = k, the choice of W is irrelevant. This is clear because the asymptotic variance does not involve W. Theorem 9.4: The optimal choice of W is to take W =Ω 1, where Ω is the covariance matrix of g.

35 35 Proof: Let Z be a random vector with mean zero and covariance matrix Ω. Note that (D WD) 1 D WZ has covariance matrix (D WD) 1 D W ΩW D(D WD) 1 which corresponds to the asymptotic variance of the GMM estimator. Also, note that (D WD) 1 D WZ = (D Ω 1 D) 1 D Ω 1 Z + {(D WD) 1 D W (D Ω 1 D) 1 D Ω 1 }Z Let A 1 =(D WD) 1 D W and A 0 =(D Ω 1 D) 1 D Ω 1. Then, we know that A 1 Z = A 0 Z +(A 1 A 0 )Z This implies that Var[A 1 Z] = Var[A 0 Z]+Var[(A 1 A 0 )Z] +A 0 Var[Z](A 1 A 0 )+(A 1 A 0 )Var[Z]A 0 = Var[A 0 Z]+Var[(A 1 A 0 )Z]

36 36 This implies that Var[A 1 Z] Var[A 0 Z]or (D WD) 1 D W ΩW D(D WD) 1 (D Ω 1 D) 1 Remember that when we are comparing covariance matrices, means that the difference between the matrices is positive semi-definite. Say for example that ˆγ n (W ) corresponds to a GMM estimator with weight matrix which converges in probability to W. Suppose that we are interested in estimating h(γ) whichh( ) mapsfromr k to R 1. By the multivariate delta method, we know that n(h(ˆγn (W )) h(γ 0 )) D N(0, h(γ 0) γ where Σ(W )=(D WD) 1 D W ΩW D(D WD) 1. Σ(W ) h(γ 0) ) γ

37 37 Since Σ(W ) Σ(Ω 1 ), we know that h(γ 0 ) γ (Σ(W ) Σ(Ω 1 )) h(γ 0) γ 0 This implies that the asymptotic variance of any real-valued function of the parameters is minimized by choosing ˆγ n (Ω 1 ).

38 38 For given g, the best we can do is to work with the objective function Q(γ; X n )= ˆq(γ; X n ) ˆΩ 1ˆq(γ; X n ) where ˆΩ P Ω. We already know that ˆΩ = 1 n i=1 g(x i;ˆγ(x n ))g(x i ;ˆγ(X n )) is a consistent estimator for Ω, where ˆγ(X n ) is a consistent estimator for γ 0. This suggests the following procedure for estimating ˆγ n (Ω 1 ). 1. Find a naive estimator for γ 0,sayˆγ n (I). 2. Compute ˆΩ using this naive estimator. 3. Compute ˆγ n (ˆΩ).

39 39 Global Efficiency of GMM Estimators Suppose that there are no nuisance parameters. For given g, we know the best W to use. But, what is the best g? First, we will demonstrate that the smallest asymptotic variance for n(ˆγ(xn ) γ 0 )isi 1 (γ 0 ), where I(γ 0 ) is the Fisher information matrix. That is, the MLE achieves the lowest variance. Recall, that the MLE can be viewed as a GMM estimator by letting g(x; γ) =ψ(x; γ).

40 40 The moment function g(x; γ) is assumed to have mean zero, i.e., E γ [g(x; γ)] = g(x; γ)p(x; γ)dµ(x) = 0 for all γ Γ Under suitable regularity conditions, we can interchange differentiation and integration. Therefore, we know that g(x; γ)p(x; γ)dµ(x) = {g(x; γ)p(x; γ)}dµ(x) =0 γ γ This implies that g(x; γ) p(x; γ)dµ(x)+ γ g(x; γ) p(x; γ) γ dµ(x) =0 and E γ [ g(x; γ) γ ]= E γ [g(x; γ)ψ(x; γ) ] D(γ)

41 41 Consider the following vector: g(x; γ 0 )+DI 1 (γ 0 )ψ(x; γ 0 ) Element by element, we can show that this is the residual from the projection of g(x; γ 0 ) onto the space spanned by the elements of the score vector. This random vector has covariance matrix E γ0 [(g(x; γ 0 )+DI 1 (γ 0 )ψ(x; γ 0 ))(g(x; γ 0 )+DI 1 (γ 0 )ψ(x; γ 0 )) ] which is equal to Ω DI 1 (γ 0 )D Since Ω DI 1 (γ 0 )D is a covariance matrix it must be positive semi-definite.

42 42 Note that (D WD) 1 D W ΩW D(D WD) 1 I 1 (γ 0 )=(D WD) 1 D W {Ω DI 1 (γ 0 )D }W D(D WD) 1 This matrix must be positive semi-definite. Since the difference between the asymptotic variance of the GMM estimator and I 1 (γ 0 ) is guaranteed to be positive semi-definite, we know that no GMM estimator can have smaller asymptotic variance than I 1 (γ 0 ).

43 43 Efficiency Results for GEE s For the class GEE estimators, the efficiency of depends on the choice of the matrix A(X; γ). Remember that the asymptotic variance for a GEE estimator is E θ0 [A(X; γ) µ(x; γ 0 ) γ ] 1 E θ0 [A(X; γ 0 )Var θ0 [Y X]A(X; γ 0 ) ]E θ0 [A(X; γ) µ(x; γ 0 ) γ ] 1 What choice of the A matrix will minimize this asymptotic variance? Theorem 9.4 The optimal choice of A(X; γ) is M(γ) Var θ0 [Y X] 1,whereM(γ) = µ(x;γ) γ.

44 44 Proof: The proof is the asymptotic version of the Gauss-Markov theorem. For simplicity, we suppress notation which is in parentheses or subscripted. Let H A = {E[AM]} 1 A. Then, Var[H A (Y µ)] = E[Var[H A (Y µ) X]] + Var[E[H A (Y µ) X]] = E[H A Var[Y X]H A] = {E[AM]} 1 E[AV ar[y X]A ]{E[AM]} 1 This is the asymptotic variance of the GEE estimator. The claim is that A opt = M Var[Y X] 1,inwhichcase H opt = {E[M Var[Y X] 1 M]} 1 M Var[Y X] 1 and the asymptotic variance is equal to {E[M Var[Y X] 1 M]} 1

45 45 Note that Var[H A (Y µ)] = Var[(H A H opt )(Y µ)+h opt (Y µ)] = Var[(H A H opt )(Y µ)] + Var[H opt (Y µ)] + E[(H A H opt )(Y µ)(y µ) H opt]+ E[H opt (Y µ)(y µ) (H A H opt ) ] = Var[(H A H opt )(Y µ)] + Var[H opt (Y µ)] This implies that Var[H A (Y µ)] Var[H opt (Y µ)] is positive semi-definite, which gives the desired result.

Section 8.2. Asymptotic normality

30 Section 8.2. Asymptotic normality We assume that X n =(X 1,...,X n ), where the X i s are i.i.d. with common density p(x; θ 0 ) P= {p(x; θ) :θ Θ}. We assume that θ 0 is identified in the sense that