Section 10: Role of influence functions in characterizing large sample efficiency

Size: px

Start display at page:

Download "Section 10: Role of influence functions in characterizing large sample efficiency"

Dennis Stone
5 years ago
Views:

1 Section 0: Role of influence functions in characterizing large sample efficiency. Recall that large sample efficiency (of the MLE) can refer only to a class of regular estimators. 2. To show this here, we will first show a characterization of a class of regular estimators through its influence function. 3. To show this characterization, we will use the concept of contiguity.

2 2 Section 0. Asymptotically linear estimators Let X n =(X,X 2,...,X n ), where the X i s are i.i.d. p(x; θ 0 ) P= {p(x; θ) :θ Θ}. Suppose that we are interested in estimating γ(θ), where γ( ) :Θ R k and Θ R q. Most estimators that we will consider are asymptotically linear. That is, there exists a random vector ϕ(x) such that n(ˆγ(xn ) γ 0 )= n n i= ϕ(x i )+o P () with E θ0 [ϕ(x)] = 0 and E θ0 [ϕ(x)ϕ(x) ] is finite and non-singular.

3 3 The function ϕ(x) is referred to as an influence function. The phrase influence function was used by Hampel (JASA, 974) and is motivated by the fact that to the first order ϕ(x) is the influence of a single observation on the estimator ˆγ(X n ). Consider ˆγ(x,...,x n,x) as a function of x. Since n(ˆγ(xn ) γ 0 ) n ( n i= ϕ(x i)+ϕ(x)), we see that ˆγ(x n ) γ 0 + n n i= ϕ(x i )+ n ϕ(x)

4 4 The asymptotic properties of an asymptotically linear estimator, ˆγ(X n ) can be summarized by considering only its influence function. Since ϕ(x) has mean zero, the CLT tells us that n n i= ϕ(x i ) D N(0,E θ0 [ϕ(x)ϕ(x) ]) By Slutsky s theorem, we know that n(ˆγ(xn ) γ 0 ) D N(0,E θ0 [ϕ(x)ϕ(x) ])

5 5 Influence Function for GMM Estimators From the previous section, we know that n(ˆγ(x n) γ 0 ) = { ˆD(ˆγ(Xn); Xn) ŴnD n (X n)} ˆD(ˆγ(X n); Xn) Ŵn = {D WD} D W = n n g(x i ; γ 0 )+o P () i= {D WD} D Wg(X i ; γ 0 )+o P () i= n g(x i ; γ 0 ) i= Hence, the influence function is ϕ(x) = {D WD} D Wg(x; γ 0 ) When the dimension of g is the same as γ and D and W are full rank, the influence function is ϕ(x) = D g(x; γ 0 )

6 6 Example: GEE TheGEEisthesolutionto n A(X i ; γ)(y i µ(x i ; γ)) = 0 i= As a special case of a GMM estimator, we showed that n(ˆγ(x n) γ 0 )= n E θ0 [A(X; γ 0 )M(X; γ 0 )] A(X i ; γ 0 )(Y i µ(x i ; γ 0 )) + o P () i= Therefore, the influence function is ϕ(y, x) = E θ0 [A(X; γ 0 )M(X; γ 0 )] A(x; γ 0 )(y µ(x; γ 0 )) The influence function associated with the optimal GEE is ϕ(y, x) = E θ0 [M(X; γ 0 ) Var θ0 [Y X] M(X; γ 0 )] M(x; γ 0 ) Var θ0 [Y x] (y µ(x; γ 0 ))

7 7

8 8 Lemma: An asymptotically linear estimator must have a unique (a.s.) influence function. Proof: Suppose the influence function is not unique. That is, there exists another influence function ϕ (x) such that E θ0 [ϕ (X)] = 0 and Since, we know that n(ˆγ(xn ) γ 0 )= n n(ˆγ(xn ) γ 0 )= n n n i= n i= n i= ϕ (X i )+o P () ϕ(x i )+o P () {ϕ(x i ) ϕ (X i )} = o P ()

9 9 By the CLT, we know that n n i= {ϕ(x i ) ϕ (X i )} D N(0,Var θ0 [ϕ(x) ϕ (X)]) In order for this limiting distribution to be o P (), Var θ0 [ϕ(x) ϕ (X)] = 0 or ϕ(x) ϕ (X) =0a.e.

10 0 In parametric models (i.e., Θ is finite-dimensional), if an estimator is asymptotically linear and regular (RAL), then we can establish some useful results regarding the geometry of the influence function as well as precisely defining efficiency. We assume that γ(θ) is continuously differentiable at least in a neighborhood of the truth θ 0.Letγ 0 = γ(θ 0 ).

11 Recall the definition of regular estimator (Section 8) Definition: Consider a local data generating process (LDGP), where for each n, the data are distributed according to θ n,where n(θn θ 0 ) τ. An estimator ˆγ(X n )issaidtoberegular if for each θ 0, n(ˆγ(x n ) γ(θ n )) has a limiting distribution that does not depend on the LDGP. So, if n(ˆγ(xn ) γ 0 ) D(θ 0) N(0, Σ 0 ) then n(ˆγ(xn ) γ(θ n )) D(θ n) N(0, Σ 0 ) We say that convergence is uniform in a shrinking neighborhood of the true parameter value. Regularity rules out superefficient estimators (see the example due to Hodges).

12 2 Section 0.2 Contiguity Most density functions in the parametric models with the kind of regularity conditions we have imposed have the property that the sequence of distributions P n (X n ; θ n )arecontiguous to the sequence of distributions P n (X n ; θ 0 ), where n(θ n θ 0 ) τ. More formally, consider the following definition: Definition: Consider a sequence of probability measures {P n,q n } on the sample spaces X n with respect to common dominating measures µ n, defined on the σ-algebras F n.let{p n,q n } be the associated sequence of densities. If for any sequence of events {A n } with A n F n, P n (A n ) 0 Q n (A n ) 0 then we say that the densities {q n } are contiguous to the densities {p n }.

13 3 This is especially useful since contiguity implies that any sequence of random variables converging to zero in P n probability also converge to zero in Q n probability. To see this, suppose that P Z n n 0. This means that for all ɛ>0, P n [ Z n >ɛ] 0 as n. Contiguity implies that Q n as n.thusz n 0. Q n [ Z n >ɛ] 0

14 4 A useful sufficient condition for identifying when a sequence of probability densities are contiguous is given by LeCam s first lemma. LeCam s First Lemma First, we give some preliminary notation and remind you about the results of the Neyman-Pearson lemma (see Section 8.3 of Casella and Berger) for finding most powerful tests. For a fixed sample size n, considerp n (x n ) to be the fixed null hypothesis and q n (x n ) to be the fixed alternative. The Neyman-Pearson lemma states the following:

15 5 For any level 0 α n, the most powerful test (randomized) is given by the test function φ n defined as 0 q n <k n p n (Do not reject the null) φ n = ξ n q n = k n p n (Reject the null with probability ξ n ) q n >k n p n (Reject the null) Such a φ n can always be defined so that the level is equal to α n, i.e., α n = φ n dp n In particular, for any sequence of events {A n } F n,a corresponding sequence of test functions can be constructed so that α n = P n (A n )= φ n dp n Clearly, another sequence of level α n test functions is given by

16 φ n(x n )=I(x n A n )sinceα n = P n (A n )= φ ndp n. 6

17 7 However, the N-P Lemma tells us that the most powerful test is given by the likelihood ratio test φ n (x n ). Therefore, Power of φ n = φ n dq n φ ndq n =Powerofφ n = Q n [A n ] This implies that to show contiguity of {q n } to {p n } it suffices to show that for the test functions φ n defined above that φ n dp n = P n [A n ] 0 φ n dq n 0

18 8 Next we introduce a sequence of likelihood ratio random variables: L n (x n )= q n(x n ) p n (x n ) More specifically, we assume that q n (x n )/p n (x n ) p x (x n ) > 0 L n (x n )= p n (x n )=q n (x n ) p n (x n )=0;q n (x n ) > 0 Let F n be the distribution of L n under P n, i.e., F n (u) =P n [L n (X n ) u]

19 9 Lemma 0. (LeCam s First Lemma): Assume that F n converges to the distribution function F such that 0 udf (u) =. Then, the densities {q n } are contiguous to the densities {p n }. Proof: Take a sequence of test functions φ n such that

20 20 φn dp n 0. Note that φ n dq n = = L n y L n y y = y = y = y = y = y φ n dq n + φ n dq n L n >y q n φ n p n dµ + φ n dq n p n L n >y φ n dp n + dq n L n >y φ n dp n + dq n L n y φ n dp n + L n dp n L n y φ n dp n + I(L n y)l n dp n φ n dp n + E Pn [I(L n y)l n ] φ n dp n + y 0 udf n (u)

21 2 Now, we can find a value y such that y udf (u) <ɛ/4. Since 0 F n F, we know that y 0 udf n(u) y udf (u). Therefore, there 0 is some N 0 such that for all n>n 0, y 0 udf n(u) <ɛ/2(use triangle inequality). Furthermore, since φ n dp n 0, we know that there exists N so that for all n>n, y φ n dp n <ɛ/2. Therefore, φ n dq n <ɛfor n>max(n 0,N ). Thus, we have shown that φ n dq n 0. Thus, we have contiguity of the sequence {q n } to the sequence {p n }.

22 22 Corollary 0.2: If log{l n } D(P n) N( σ 2 /2,σ 2 ), then the densities {q n } are contiguous to the densities {p n }. Proof: We see that L n converges in distribution to a log-normal random variable. Verify that the expectation of a log normal random variable with mean σ 2 /2andvarianceσ 2 is equal to.

23 23 We can know apply Corollary 0.2 to our i.i.d. situation. We know that p n (x n )= n i= p(x i; θ 0 )andq n (x n )= n i= p(x i; θ n ), where n(θn θ 0 ) τ. Now, log L n (X n )=log{ q n n(x n ) p n (X n ) } = {log p(x i ; θ n ) log p(x i ; θ 0 )} i= By mean value expansion, we see that n log p(x i ; θ n ) = i= n log p(x i ; θ 0 )+ i= 2 (θ n θ 0 ) n i= n i= log p(x i ; θ 0 ) θ (θ n θ 0 )+ 2 log p(x i ; θ n) θ θ (θ n θ 0 )

24 24 This implies that log L n (X n ) = D n n i= ψ(x i ; θ 0 ) n(θ n θ 0 )+ 2 n(θn θ 0 ) n n i= N( 2 τ I(θ 0 )τ,τ I(θ 0 )τ) 2 log p(x i ; θ n) θ θ n(θn θ 0 ) This satisfies the corollary with σ 2 = τ I(θ 0 )τ, thusprovingthat the densities { n i= p(x i; θ n )} are contiguous to { n i= p(x i; θ 0 )}.

25 25 Section 0.3 Characterization of regular asymptotic linear estimators Theorem 0.3: Suppose that ˆγ(X n ) is an asymptotically linear estimator with influence function ϕ(x) ande θ [ϕ(x)ϕ(x) ]exists and is continuous in the neighborhood of θ 0. Then, ˆγ(X n )is regular if and only if γ(θ 0 ) θ = E θ0 [ϕ(x)ψ(x; θ 0 ) ] Proof: By the definition of an influence function, we know that n(ˆγ(xn ) γ 0 )= n n i= ϕ(x i )+o P (θ0 )()

26 26 That is, P θ0 [ n(ˆγ(xn ) γ 0 ) n n i= ϕ(x i ) >ɛ] 0 By contiguity, we also know that P θn [ n(ˆγ(xn ) γ 0 ) n n i= ϕ(x i ) >ɛ] 0 or [ n(ˆγ(x n ) γ 0 ) n n i= ϕ(x i)] P (θ n) subtracting similar terms, we find that 0. Adding and n(ˆγ(x n) γn) n (ϕ(x i ) E θ n [ϕ(x)]) + n(γn γ 0 ) ne θ n [ϕ(x)] P (θ n) 0 i=

27 27 or n(ˆγ(xn ) γ n ) = n (ϕ(x i ) E θn [ϕ(x)]) () n i= n(γn γ 0 )+ (2) neθn [ϕ(x)] + (3) o P (θn )()

28 28 We can invoke the CLT for triangular arrays to show that () converges in distribution to N(0,E θ0 [ϕ(x)ϕ(x) ]) random vector. For (2), we start by using the mean value theorem to show that γ n = γ(θ n )=γ 0 + γ(θ n) θ (θ n θ 0 ) This implies that n(γ n γ 0 )= γ(θ n ) θ n(θn θ 0 ) γ(θ 0) θ τ.for (3), we know that neθn [ϕ(x)] = nz ϕ(x)p(x; θ n )dµ(x) = n Z = n{ Z = Z ϕ(x) ϕ(x){p(x; θ 0 )+ p(x; θ n) θ (θ n θ 0 )}dµ(x) ϕ(x)p(x; θ 0 )dµ(x)+ Z ϕ(x) p(x;θ n ) θ p(x; θ 0 ) p(x; θ 0) n(θ n θ 0 )dµ(x) p(x;θ n ) θ p(x; θ 0 ) p(x; θ 0)(θ n θ 0 )dµ(x)}

29 29 Thus, neθn [ϕ(x)] ϕ(x)ψ(x; θ 0 ) τp(x; θ 0 )dµ(x) =E θ0 [ϕ(x)ψ(x; θ 0 ) ]τ Putting all of these results together, we find that n(ˆγ(xn ) γ n ) D(θ n) N(E θ0 [ϕ(x)ψ(x; θ 0 ) ]τ γ(θ 0) θ τ,e θ0 [ϕ(x)ϕ(x) ]) In order for ˆγ(X n ) to be regular, we must have E θ0 [ϕ(x)ψ(x; θ 0 ) ] γ(θ 0) θ =0 which gives the desired result. The other direction is obviously true.

30 30 Section 0.4 Geometry and Efficiency of influence functions among regular asymptotically linear estimators Suppose that we parameterize our problem so that θ =(γ,λ ).In this case, γ(θ 0 ) θ =[[I] [0] ] = [ [E θ0 [ϕ(x)ψ γ (X; θ 0 ) ]] [E θ0 [ϕ(x)ψ λ (X; θ 0 ) ]]] Consider the Hilbert space, H, ofk-dimensional functions of X with mean zero and finite second moments, equipped with inner product <h,h 2 >= E θ0 [h (X) h 2 (X)]. The nuisance tangent space is the linear subspace Λ={Bψ λ (X; θ 0 ): B is an arbitrary k q matrix of real numbers} The theorem tells us that the influence function, ϕ(x), of a RAL estimator, ˆγ(X n ), must be orthogonal to every element in Λ. That is, ϕ(x) Λ = {h(x) H: E θ0 [h(x) Bψ λ (X; θ 0 )] = 0}. In

31 addition, E θ0 [ϕ(x)ψ γ (X; θ 0 )] = I. 3

32 32 Can we show how take elements from Λ to construct an estimating equation, whose solution is regular and asymptotically linear? As we will see, the random vectors in Λ depend on θ 0.Considerany particular element in Λ,sayh(X; θ 0 ). Now, we know that whatever be γ 0 Γandλ 0 λ. E γ0,λ 0 [h(x; γ 0,λ 0 )] = 0 For given γ, letˆλ n (γ) be a profile estimator of λ. Assume that ˆλ n (γ) is continuously, differentiable function of γ in a neighborhood of γ 0 and ˆλ n (γ 0 ) γ converges in probability to d(θ 0 ). For some ɛ>0, n /4+ɛ (ˆλ n (γ 0 ) λ 0 ) is bounded in probability.

33 33 For example, suppose that for fixed γ, ˆλ n (γ) solves n i= m(x i,γ,λ)=0,wherem(x; γ,λ) isq-dimensional estimating function and E θ0 [m(x; γ 0,λ 0 )] = 0. It can be shown that n(ˆλn (γ 0 ) λ 0 ) is bounded in probability. By the implicit function theorem, we know 0 = P n i= m(x i,γ 0, ˆλn(γ 0 )) γ = i= m(x i,γ 0, ˆλn(γ 0 )) γ + i= m(x i,γ 0, ˆλn(γ 0 )) λ ˆλn(γ 0 ) γ This implies that ˆλn(γ 0 ) γ = 2 4 n i= m(x i,γ 0, ˆλn(γ 0 )) λ n 3 5 i= m(x i,γ 0, ˆλn(γ 0 )) P E θ0 [ m(x, γ 0,λ 0 ) ] E θ0 [ m(x, γ 0,λ 0 ) ] d(θ 0 ) λ γ γ

34 34 Now, estimate γ 0 as the solution, ˆγ n,to n h(x i ; γ, ˆλ n (γ)) = 0 i= Under regularity conditions, it can be shown that ˆγ n is a consistent estimator of γ 0. What is the influence function for ˆγ n? By mean value expansions, we know that 0 = = n n h(x i ;ˆγn, ˆλn(ˆγn)) i= h(x i ; γ 0, ˆλn(γ 0 )) + i= 2 4 n i= h(x i ; γ n, ˆλn(γ n )) γ + n i= h(x i ; γ n, ˆλn(γ n )) λ ˆλn(γ n ) γ 3 5 n(ˆγ n γ 0 )

35 35 n(ˆγ n γ 0 ) = 2 4 n i= h(x i ; γ n, ˆλn(γ n )) γ n + n i= h(x i ; γ n, ˆλn(γ n )) λ ˆλn(γ n ) γ 3 5 h(x i ; γ 0, ˆλn(γ 0 )) (4) i= Now, for each component of h(x, γ 0, ˆλ n (γ 0 )), we know that n h j (X i ; γ 0, ˆλn(γ 0 )) = i= n n h j (X i ; γ 0,λ 0 )+ i= i= h(x i ; γ 0,λ 0 ) λ n /4+ɛ n/4+ɛ (ˆλn(γ 0 ) λ 0 )+ 2 n /4 (ˆλn(γ 0 ) λ 0 ) { n i= 2 h j (X i ; γ 0,λ n ) λ λ }n /4 (ˆλn(γ 0 ) λ 0 ) Thus, n h j (X i ; γ 0, ˆλn(γ 0 )) = i= n h j (X i ; γ 0,λ 0 )+o P () (5) i=

36 36 Plugging (5) into (4), we get n(ˆγ n γ 0 ) = 2 4 n i= h(x i ; γ n, ˆλn(γ n )) γ { n + n i= h(x i ; γ n, ˆλn(γ n )) λ ˆλn(γ n ) γ 3 5 h(x i ; γ 0,λ 0 )) + o P ()} (6) i= Since, we can show that n i= h(x i ; γ n, ˆλn(γ n )) γ P E θ0 [ h(x i ; γ 0,λ 0 ) ]= E θ0 [h(x; θ 0 )ψγ (X; θ 0 ) ] γ n i= h(x i ; γ n, ˆλn(γ n )) λ P E θ0 [ h(x i ; γ 0,λ 0 ) ]= E θ0 [h(x; θ 0 )ψ λ (X; θ 0 ) ]=0 λ ˆλn(γ n ) γ P d(θ 0 )

37 37 n h(x i ; γ 0,λ 0 )) + o P () D N(0,E θ0 [h(x i ; γ 0,λ 0 )h(x i ; γ 0,λ 0 ) ]) i= we can conclude that n(ˆγ n γ 0 )= n i= E θ0 [ h(x i ; γ 0,λ 0 ) ] h(x i ; γ 0,λ 0 )+o P () γ

38 38 Thus, n(ˆγ n γ 0 ) D N(0,E θ0 [ h(x i ; γ 0,λ 0 ) ] E θ0 [h(x i ; γ 0,λ 0 )h(x i ; γ 0,λ 0 ) ]E θ0 [ h(x i ; γ 0,λ 0 ) ] ) γ γ So, the influence function for ˆγ n is E θ0 [ h(x i;γ 0,λ 0 ) γ ] h(x; γ 0,λ 0 ), which is the influence function that we would have derived if we had pretended as if λ 0 were known.

39 39 Algorithm. Select an element from the orthogonal complement of the nuisance tangent space. 2. Find an (profile) estimator for the nuisance parameter. 3. The influence function for the resulting estimator will be the same as if the nuisance parameter had been known. 4. This procedure carries over to the case where the nuisance parameter is infinite dimensional!

40 40 Efficiency A geometric interpretation can be useful in helping us define the most efficient estimator among the class of regular, asymptotically linear (RAL) estimators. Since the asymptotic variance of a RAL estimator is the variance of its influence function, we can now look for the most efficient estimator by finding the influence function with the smallest variance. If we have two influence functions, say ϕ (X) andϕ 2 (X), we say that Var θ0 [ϕ (X)] <= Var θ0 [ϕ 2 (X)] if and only if for a R q. Var θ0 [a ϕ (X)] <= Var θ0 [a ϕ 2 (X)]

41 4 Or equivalently, a E θ0 [ϕ (X)ϕ (X) ]a<= a E θ0 [ϕ 2 (X)ϕ 2 (X) ]a which means that E θ0 [ϕ 2 (X)ϕ 2 (X) ] E θ0 [ϕ (X)ϕ (X) ] is positive semi-definite.

42 42 Efficient Score and Efficient Influence Function Definition: The efficient score for γ is defined as the residual from the projection of ψ γ (X; θ 0 )ontoλ.thatis, where ψ eff γ (X; θ 0 )=ψ γ (X; θ 0 ) Π[ψ γ (X; θ 0 ) Λ] Π[ψ γ (X; θ 0 ) Λ] = E θ0 [ψ γ (X; θ 0 )ψ λ (X; θ 0 ) ]E θ0 [ψ λ (X; θ 0 )ψ λ (X; θ 0 ) ] ψ λ (X; θ 0 ) By construction ψγ eff (X; θ 0 ) Λ. By normalizing it so that that its inner product with ψ γ (X; θ 0 ) is one, we then define the efficient influence function ϕ eff (X) = E θ0 [ψγ eff = E θ0 [ψγ eff (X; θ 0 )ψ γ (X; θ 0 ) ] ψ eff γ (X; θ 0 ) (X; θ 0 )ψ eff γ (X; θ 0 ) ] ψ eff γ (X; θ 0 ) We are now in a position to prove that ϕ eff (X) isthemost

43 efficient influence function. 43

44 44 Claim: All influence functions can be written as ϕ eff (X)+l(X) where l(x) Λandl(X) {Bϕ eff (X) : forallb}. Since ϕ eff (X) is an influence function, it is clear that Λ {Bϕ eff (X) : forallb}. Now, l(x) Λ {Bϕ eff (X) : forallb} =[Λ {Bϕ eff (X) : forallb}] Proof: If ϕ(x) is an influence function, then ϕ(x) Λ.In addition ϕ eff (X) Λ. Therefore, l(x) =ϕ(x) ϕ eff (X) Λ. Since l(x) Λ, we know that E θ0 [l(x)ψγ eff (X; θ 0 ) ]= E θ0 [l(x)ψ γ (X; θ 0 ) ]=E θ0 [(ϕ(x) ϕ eff (X))ψ γ (X; θ 0 ) ] = 0. Since the space spanned by ψγ eff (X; θ 0 )isequalto {Bϕ eff (X) : forallb}, wehavethat l(x) {Bϕ eff (X) : forallb}.

45 45 From this, it is now easy to show that ϕ eff (X) is the influence function with the smallest variance. Note that E θ0 [ϕ(x)ϕ(x) ] = E θ0 [(ϕ eff (X)+l(X))(ϕ eff (X)+l(X)) ] = E θ0 [ϕ eff (X)ϕ eff (X)] + E θ0 [l(x)l(x) ] Hence, E θ0 [ϕ(x)ϕ(x) ] E θ0 [ϕ eff (X)ϕ eff (X)] is positive semi-definite. This implies that ϕ eff (X) is the most efficient influence function. The variance of the efficient influence function is E θ0 [ψγ eff (X; θ 0 )ψγ eff (X; θ 0 ) ], the inverse of the variance of the efficient score. Then we get the well known result that the minimum variance for the most efficient, regular estimator is [I γγ (θ 0 ) I γλ (θ 0 )I λλ (θ 0 ) I γλ (θ 0 ) ]

Section 9: Generalized method of moments

1 Section 9: Generalized method of moments In this section, we revisit unbiased estimating functions to study a more general framework for estimating parameters. Let X n =(X 1,...,X n ), where the X i