GMM Estimation and Testing Whitney Newey July 2007
Idea: Estimate parameters by setting sample moments to be close to population counterpart. Definitions: β : p 1 parameter vector, with true value β 0. g i (β) = g(w i,β):m 1 vector of functions of i th data observation w i and paramet Model (or moment restriction): E[g i (β 0 )] = 0. Definitions: ĝ(β) def = nx i=1 g i (β)/n : Sample averages. Â : m m positive semi-definite matrix. GMM ESTIMATOR: ˆβ =argmin β ĝ(β) 0 Âĝ(β).
GMM ESTIMATOR: ˆβ =argmin β ĝ(β) 0 Âĝ(β). Interpretation: Choosing ˆβ so sample moments are close to zero. For kgkâ = q g 0Âg, same as minimizing kĝ(β) 0k Â. When m = p, theˆβ with ĝ(ˆβ) =0will be the GMM estimator for any  When m > p then  matters. MethodofMomentsisSpecialCase: Moments : E[y j ]=h j (β 0 ), (1 j p), Specify moment functions : g i (β) =(y i h 1 (β),...,y p i h p(β)) 0, Estimator : ĝ(ˆβ) =0same as y j = h j (ˆβ), (1 j p).
Two-stage Least Squares as GMM: Model : y i = Xi 0 β + ε i,e[z i ε i ]=0,(i =1,...,n). Specify moment functions : g i (β) =Z i (y i Xi 0 β). Sample moments are : ĝ(β) = nx i=1 Z i (y i X 0 i β)/n = Z0 (y Xβ)/n Z = [Z 1,...,Z n ] 0,X =[X 1,...,X n ] 0,y =(y 1,...,y n ) 0. Specify distance (weighting) matrix : Â =(Z 0 Z/n) 1. GMM estimator is 2SLS : ˆβ =argmin β [(y Xβ) 0 Z/n](Z 0 Z/n) 1 Z 0 (y Xβ)/n = argmin β (y Xβ) 0 Z(Z 0 Z) 1 Z 0 (y Xβ).
Intertemporal CAPM Nonlinear in parameters. c i consumption at time i, R i is asset return between i and i+1, α 0 is time discount factor, u(c, γ 0 ) utility function; I i is variables observed at time i; Agent maximizes E[ X j=0 subject to intertemporal budget constraint. α j 0 u(c i+j,γ 0 )]
First-order conditions are E[R i α 0 u c (c i+1,γ 0 )/u c (c i,γ 0 ) I i ]=1. Marginal rate of substitution between i and i +1 equals rate of return (in expected value). Let Z i denote a m 1 vector of variables observable at time i (like lagged consumption, lagged returns, and nonlinear functions of these). Moment function is g i (β) =Z i {R i α u c (c i+1,γ 0 )/u c (c i,γ 0 ) 1}. Here GMM is nonlinear instrumental variables. Empirical Example: Hansen and Singleton (1982, Econometrica).
IDENTIFICATION: Identification precedes estimation. If a parameter is not identified then no consistent estimator exists. Identification from moment functions: Let ḡ(β) =E[g i (β)]. Identification condition is β 0 IS THE ONLY SOLUTION TO ḡ(β) =0 Necessary condition for identification is that m p. When m<p,i.e. there are fewer equations to solve than parameters, there will typically be multiple solutions to the moment conditions. In IV need at least as many instrumental variables as right-hand side variables.
Rank condition for identification: Let G = E[ g i (β 0 )/ β]. Rank condition is rank(g) =p. Necessary and sufficient for identification when g i (β) is linear in β. 0=ḡ(β) =ḡ(β 0 )+G(β β 0 )=G(β β 0 ) has a unique solution at β 0 if and only if the rank of G equals the number of columns of G. For IV G = E[Z i Xi 0 ] so that rank(g) =p is usual rank condition.
In the general nonlinear case it is difficult to specify conditions for uniqueness of the solution to ḡ(β) =0. rank(g) =p will be sufficient for local identification. There exists a neighborhood of β 0 such that β 0 is the unique solution to ḡ(β) for all β in that neighborhood. Exact identification is m = p. For IV as many instruments as right-hand side variables. Here the GMM estimator will satisfy ĝ(ˆβ) =0asymptotically; see notes. Overidentification is m>p.
TWO STEP OPTIMAL GMM ESTIMATOR: When m>pgmm estimator depends on weighting matrix Â. Optimal  minimizes asymptotic variance of ˆβ. Let Ω be asymptotic variance of nĝ(β 0 )= P n i=1 g i (β 0 )/ n,i.e. d nĝ(β0 ) N(0, Ω). In general, central limit theorem for time series gives Ω = lim n E[nĝ(β 0)ĝ(β 0 ) 0 ]. Optimal  satisfies  p Ω 1. Take  = ˆΩ 1 for ˆΩ a consistent estimator of Ω.
Estimating Ω. Let β be preliminary GMM estimator (known Â). Let g i = g i ( β) ĝ( β) If E[g i (β 0 )g i+ (β 0 ) 0 ]=0thenforsomepre ˆΩ = nx i=1 g i g i 0 /n. Example is CAPM model above.
If have autocorrelation in moment conditions then ˆΩ = ˆΛ 0 + ˆΛ = n X i=1 LX =1 g i g 0 i+ /n. w L (ˆΛ + ˆΛ 0 ), Weights w L make ˆΩ positive semi-definite. Bartlett weights w L =1 /(L +1), as in Newey and West (1987). Choice of L.
Two step optimal GMM: ˆβ =argmin β ĝ(β) 0 ˆΩ 1 ĝ(β). Two step optimal instrumental variables: For ĝ(β) =Z 0 (y Xβ)/n, ˆβ =(X 0 Z ˆΩ 1 Z 0 X) 1 X 0 Z ˆΩ 1 Z 0 y.
CONSISTENT ASYMPTOTIC VARIANCE ESTIMATION Optimal two step GMM estimator has d n(ˆβ β 0 ) N(0,V),V =(G 0 Ω 1 G). To form asymptotic t-statistics and confidence intervals need a consistent estimator ˆV of V. Let Ĝ = ĝ(ˆβ)/ β. A consistent estimator of V is ˆV =(Ĝ 0 ˆΩ 1 Ĝ) 1. Could also update ˆΩ.
ASYMPTOTIC THEORY FOR GMM: Consistency for i.i.d. data. If the data are i.i.d. and i) E[g i (β)] = 0 if and only if β = β 0 (identification); ii) the GMM minimization takes place over a compact set B containing β 0 ; iii) g i (β) is continuous at each β with probability one and E[sup β B kg i (β)k] is finite; iv) Â p A positive definite; then ˆβ p β 0. Proof in Newey and McFadden (1994). Idea: ĝ(β 0 ) p 0 by law of large numbers. Therefore ĝ(β 0 ) 0 Âg(β 0 ) p 0. Also ĝ(ˆβ) 0 Âĝ(ˆβ) ĝ(β 0 ) 0 Âg(β 0 ), so ĝ(ˆβ) 0 Âg(ˆβ) p 0. Theonlywaythiscanhappenisifˆβ p β 0.
Asymptotic normality: If the data are i.i.d., ˆβ p β 0 and i) β 0 is in the interior of the parameter set over which minimization occurs; ii) g i (β) is continuously differentiable on a neighborhood N of β 0 iii) E[sup β N k g i (β)/ βk] is finite; iv) Â p A and G 0 AG is nonsingular, for G = E[ g i (β 0 )/ β]; v)ω = E[g i (β 0 )g i (β 0 ) 0 ] exists, then n(ˆβ β 0 ) d N(0,V),V =(G 0 AG) 1 G 0 AΩAG(G 0 AG) 1. See Newey and McFadden (1994) for the proof. Can give derivation:
Let Ĝ = ĝ(ˆβ)/ β. First order conditions are 0=Ĝ 0 Âĝ(ˆβ) Expand ĝ(ˆβ) around β 0 to obtain ĝ(ˆβ) =ĝ(β 0 )+Ḡ(ˆβ β 0 ), where Ḡ = ĝ( β)/ β and β lies on the line joining ˆβ and β 0, and actually differs from row to row of Ḡ. Substitute this back in first order conditions to get 0=Ĝ 0 Âĝ(β 0 )+Ĝ 0 ÂḠ(ˆβ β 0 ), Solve for ˆβ β 0 and mulitply through by n to get n(ˆβ β 0 )= ³ Ĝ 0 ÂḠ 1 Ĝ0 Â nĝ(β 0 ).
Recall n(ˆβ β 0 )= ³ Ĝ 0 ÂḠ 1 Ĝ0  nĝ(β 0 ). By central limit theorem nĝ(β0 ) d N(0, Ω) Also we have  theorem, p A, Ĝ p G, Ḡ ³Ĝ0 ÂḠ 1 Ĝ0  p ³ G 0 AG 1 G 0 A p G, so by the continuous mapping Then by the Slutzky lemma. d n(ˆβ β 0 ) ³ G 0 AG 1 G 0 AN(0, Ω) =N(0,V).
The fact that A = Ω 1 minimizes the asymptotic varince follows from the Gauss Markov Theorem. Consider a linear model. E[Y ]=Gδ, V ar(y )=Ω. (G 0 Ω 1 G) 1 is the variance of generalized least squares (GLS) in this model. V =(G 0 AG) 1 G 0 AΩAG(G 0 AG) 1 is the variance of ˆδ =(G 0 AG) 1 G 0 AY. GLS has smallest variance by Gauss-Markov theorem, V (G 0 Ω 1 G) 1 is p.s.d.
TESTING IN GMM: Overidentification test statistic is nĝ(ˆβ) 0 ˆΩ 1 ĝ(ˆβ). Limiting distribution is chi-squared with m p degrees of freedom. Tests of H 0 : s(β) =0. Wald test statistic: ns(ˆβ) 0 [ s(ˆβ)/ β(ĝ 0 ˆΩ 1 Ĝ) 1 s(ˆβ)/ β] 1 s(ˆβ). Likelihood ratio (sum of squared residuals test statistic) min β:s(β)=0 nĝ(β)0 ˆΩ 1 ĝ(β) nĝ(ˆβ) 0 ˆΩ 1 ĝ(ˆβ). Lagrange mulitplier test statistic.
USING MANY MOMENTS IN GMM: Using more moments increases asymptotic efficiency of optimal two-step GMM. Suppose that g i (β) =(g 1 i (β)0,g 2 i (β)0 ) 0. Optimal GMM estimator for just the first set of moment conditions gi 1 (β) uses à (ˆΩ  = 1 ) 1! 0, 0 0 This  is not generally optimal for the entire moment function vector g i (β). Optimal GMM estimator for more moments asymptotically more efficient. But gives more bias for two-step GMM.
To see bias we look at expectation of GMM objective function for i.i.d. case. Similar in time series. Idea is that if expectation not minimized at truth then have bias, because estimator getting close to minimum of the expectation. Assume A is not random. Let ḡ(β) =E[g i (β)] and Ω(β) =E[g i (β)g i (β) 0 ]. Note that E[g i (β) 0 Ag i (β)] = E[tr{g i (β) 0 Ag i (β)}] =E[tr{Ag i (β)g i (β) 0 }] = tr{e[ag i (β)g i (β) 0 ]} = tr{aω(β)}.
For GMM objective function ˆQ(β) =ĝ(β) 0 Aĝ(β) is E[ ˆQ(β)] = E[ nx i,j=1 g i (β) 0 Ag j (β)]/n 2 = = X i6=j ḡ(β) 0 Aḡ(β)/n 2 + nx i=1 nx i,j=1 E[g i (β) 0 Ag j (β)]/n 2 E[g i (β) 0 Ag i (β)]/n 2 = (1 n 1 )ḡ(β) 0 Aḡ(β)+tr(AΩ(β))/n. First term (1 n 1 )ḡ(β) 0 Aḡ(β) is minimized at β 0. However, the second term is NOT generally minimized at β 0, and hence neither is E[ ˆQ(β)]. Han and Phillips (2005) noise term.
E[ ˆQ(β)] = (1 n 1 )ḡ(β) 0 Aḡ(β)+tr(AΩ(β))/n. Second, "bias" term shrinks at rate n 1, so GMM consistent. Bias in GMM behaves like size of second term, m/n. Higher-order asymptotics can be used to show that this does indeed happen. There are two known approaches to reducing the bias of GMM from this source. A) Subtract from the GMM objective function an estimate of the bias. B) Make the bias not depend on β by right choice of A.
Turns out B) is better. Choose A = Ω(β) 1. Then E[ ˆQ(β)] = (1 n 1 )ḡ(β) 0 Ω(β) 1 ḡ(β)+tr(ω(β) 1 Ω(β))/n = (1 n 1 )ḡ(β) 0 Ω(β) 1 ḡ(β)+m/n. Bias term does not depend on β, so that the objective function will be minimized at the true value. This estimator is not feasbile, because Ω(β) is known. A feasible version can be obtained by replacing Ω(β) with its estimator ˆΩ(β). The estimator that minimizes the resulting objective function is given by β =argminĝ(β) 0 ˆΩ(β) 1 ĝ(β). β This is the continuous updating estimator of Hansen, Heaton, and Yaron (1995). Time series extension: Replace ˆΩ(β) with autocorrelation consistent estimator like that described above.
GENERALIZED EMPIRICAL LIKELIHOOD (GEL) Estimators share low bias property of continuous updating estimator. Let ρ(u) be concave function of scalar argument u. Estimator is: ˆβ =argmin β max λ nx i=1 ρ(λ 0 g i (β)). This is continuous updating estimator for ρ(u) quadratic. For ρ(u) =ln(1 u) it is empirical likelihood. For ρ(u) = e u it is exponential tilting. Time series version, Kitamura (1997): Relace g i (β) with g i (β) = LX = L w L g i (β).
Has small bias like continuous updating because ˆλ close to zero so ρ(u) approximately quadratic, i.e. estimator behaves like continuous updating. Also, can be seen from first order conditions. Focus on empirical likelihood for i.i.d. data. The GMM first order conditions are Ĝ 0 ˆΩ( β) 1 ĝ(ˆβ) =0. The first order conditions for empirical likelihood are, for ĝ i = g i (ˆβ),
Define Solving gives λ : 0 = β : 0 = G = nx i=1 nx i=1 nx i=1 1 1 ˆλ 0 ĝ i = nĝ(ˆβ)+ g i 1 1 ˆλ 0 g i (ˆβ)/ β 0ˆλ g i 1 1 ˆλ 0 g i g i (ˆβ)/ β, Ω = G 0 Ω 1 ĝ(ˆβ) =0. nx i=1 nx i=1 1 1 ˆλ 0 g i ĝ i ĝ 0 iˆλ. 1 1 ˆλ 0 g i ĝ i ĝ 0 i. Differs from GMM in replacing the sample averages Ĝ and ˆΩ( β) by the weighted averages G and Ω. This replacement removes a source of bias because they are the weights that make the moment conditions sum to zero (from the first-order conditions for ˆλ). Newey and Smith (1994) show that among bias corrected estimators the empirical likelihood estimator is higher order efficient with i.i.d. data. It inherits the famous higher-order efficiency properties of maximum likelihood.