Lecture 3 Consistency of Extremum Estimators 1 This lecture shows how one can obtain consistency of extremum estimators. It also shows how one can find the robability limit of extremum estimators in cases where they are not consistent. The arameter sace of interest is Θ R d. Extremum estimators EE) { θ n : n 1} are defined to be random elements of Θ that aroximately minimize a stochastic criterion function ˆQ n θ). That is, θ n is defined to satisfy Definition EE: θ n Θ and ˆQ n θ n ) inf θ Θ ˆQn θ) + o 1). Here are some examles of extremum estimators: 1) Maximum Likelihood ML) Estimator: Suose the data {W i : i n} are iid with density fw, θ) with resect to some measure µ that does not deend on θ). The likelihood function is Π n fw i, θ). The log-likelihood function is log fw i, θ). The ML estimator θ n maximizes the likelihood function or the log-likelihood function over Θ. Equivalently, the ML estimator θ n minimizes at least u to o 1)) over θ Θ. ˆQ n θ) = 1 n log fw i, θ) 1) In most cases in econometrics, the vector W i can be decomosed into two subvectors: W i = Y i, X i ), where Y i is a vector of outcome or resonse) variables and X i is a vector of exlanatory variables. Suose the conditional density of Y i given X i = x is fy x, θ) and the marginal density of X i is some density gx) that does not deend on θ. Then, fw, θ) = fy x, θ)gx) and ˆQ n θ) = 1 n = 1 n log fw i, θ) log fy i X i, θ) 1 n log gx i). 2) Since the second term on the right-hand side does not deend on θ, it can be deleted without affecting the definition of the ML estimator. That is, one can define the ML estimator to minimize minus the conditional log-likelihood ˆQ n θ) = 1 n log fy i X i, θ) 3) 1 The note for this lecture is largely adated from the note of Donald Andrews on the same toic. I am grateful for Professor Andrews generosity and elegant exosition. All errors are mine. Xiaoxia Shi Page: 1
which does not require that one secifies the marginal distribution of the exlanatory variables X i. In the case of deendent data, we write the log-likelihood function as the roduct of conditional distributions. Suose W i = Y i, X i ), where Y i and X i are as above. Let fy i x i, x i 1,..., x 1, y i 1, y i 2,..., y 1, θ) denote the conditional density of Y i given X i = x i, X i 1 = x i 1,..., X 1 = x 1 and Y i 1 = y i 1, Y i 2 = y i 2,..., Y 1 = y 1. Let gx i x i 1,..., x 1, y i 1,..., y 1 ) denote the conditional density of X i given X i 1 = x i 1,..., X 1 = x 1 and Y i 1 = y i 1,..., Y 1 = y 1. If g ) does not deend on θ, then the exlanatory variables {X 1,..., X n } are said to be weakly exogenous. If g ) does not deend on θ or y i 1,..., y 1, then the exlanatory variables are said to be strongly exogenous. Suose the exlanatory variables are weakly or strongly exogenous. function can be written as Then, the likelihood = gx 1 )fy 1 X 1, θ) gx 2 X 1, Y 1 )fy 2 X 2, X 1, Y 1, θ)... gx n X n 1,..., X 1, Y n 1,..., Y 1 )fy n X n,..., X 1, Y n 1,..., Y 1, θ) n fy i X i,..., X 1, Y i 1,..., Y 1, θ)gx i X i 1,..., X 1, Y i 1,..., Y 1 ). 4) The log-likelihood function is n n log fy i X i,..., X 1, Y i 1,..., Y 1, θ) + log gx i X i,..., X 1, Y i 1,..., Y 1 ). 5) In this case, the ML estimator minimizes n ˆQ n θ) = 1 n log fy i X i,..., X 1, Y i 1,..., Y 1, θ). 6) If {Y i : i 1} is Markov of order κ, then fy i x i,..., x 1, y i 1,..., y 1, θ) = fy i x i,..., x i κ, y i 1,..., y i κ, θ) and we take n ˆQ n θ) = 1 n log fy i X i,..., X i κ, Y i 1,..., Y i κ, θ). 7) 2) Least Squares LS) Estimator for Nonlinear Regression: Suose the data W i = Y i, X i ) for i n are iid and satisfy the nonlinear regression model Y i = gx i, θ 0 ) + ε i, 8) Xiaoxia Shi Page: 2
where Eε i X i ) = 0 a.s. The LS estimator θ n minimizes ˆQ n θ) = 1 n Y i gx i, θ)) 2 /2 9) over θ Θ. The scale factor 1/2 is used because it is convenient for the asymtotic normality results given below. It has no effect on the consistency results.) 3) Generalized Method of Moments GMM) Estimator: Let θ 0 denote the true value. Suose the data {W i : i n} are iid and the following moment conditions hold EgW i, θ) { = 0 if θ = θ 0 0 if θ θ 0, 10) where gw, θ) R k for some k d. Let A n be a k k random weight) matrix. Then, the GMM estimator θ n minimizes ˆQ n θ) = 1 A n n over θ Θ, where is the Euclidean norm. gw i, θ) 2 /2 11) 4) Minimum Distance MD) Estimator: Let π n be a consistent unrestricted estimator of a k- vector arameter π 0. Suose π 0 is known to be a function of a d-vector arameter θ 0, where d k: π 0 = gθ 0 ). 12) Let A n be a k k random weight matrix. Then, the MD estimator θ n minimizes ˆQ n θ) = A n π n gθ)) 2 /2 13) over θ Θ. For examle, in the correlated random effects anel data model, π n is the OLS or FGLS) estimator of π 0. 2 5) Two-ste TS) Estimator: Suose the data {W i : i n} are iid, τ n is a reliminary consistent estimator of a arameter τ 0, G n θ, τ) is a random k-vector that should be close to 0 when 2 Wooldridge 2002),. 445) A anel data correlated random effect model is of the form: y it = ψ + x it β + c i + v it, 14) where x it is a k-dimensional row vector, x i = x i1,..., x it ), c i = T t=1 x itλ t, E v it ) = 0 and Ex i v it) = 0, t = 1,..., T. The structual arameter of the model is θ = ψ, λ 1,..., λ T, β ). To aly the MD aroach, one first estimates by OLS) y it = π t0 + x i π t + v it, t = 1,..., T. 15) Let π = π 10, π 1,..., π T 0, π T ). Then one forms the MD roblem using the restriction π = Hθ, 16) Xiaoxia Shi Page: 3
θ = θ 0, τ = τ 0, and n is large e.g., in the GMM case, G n θ, τ) = 1 n gw i, θ, τ) and in the MD case G n θ, τ) = πτ) gθ, τ)), and A n is a k k random weight matrix. Then, the TS estimator θ n minimizes over θ Θ. The criterion function ˆQ n θ) is assumed to satisfy: Assumtion U WCON: su θ Θ ˆQ n θ) Qθ) Θ. ˆQ n θ) = A n G n θ, τ n ) 2 /2 17) For a given estimator of interest, one has to verify this condition. What is the function Qθ) in our examles? It is given by 0 for some nonstochastic function Qθ) on 1) ML Estimator: Qθ) = E log fw i, θ), where E denotes exectation under the true distribution generating the data. 2) LS Estimator: Qθ) = EY i gx i, θ)) 2 /2. 3) GMM Estimator: Qθ) = AEgW i, θ) 2 /2, where A n A. 4) MD Estimator: Qθ) = Aπ 0 gθ)) 2 /2, where A n A and πn π 0. 5) TS Estimator: Qθ) = AGθ, τ 0 ) 2 /2, where A n A, τn τ0, and G n θ, τ) Gθ, τ). The next assumtion is that of identifiable uniqueness of some non-stochastic element θ 0 of Θ. It is to this arameter value that the estimators { θ n : n 1} converge as n. Let Bθ, ε) denote an oen ball in Θ of radius ε centered at θ. Assumtion ID: There exists θ 0 Θ such that ε > 0, inf θ / Bθ0,ε) Qθ) > Qθ 0 ). A necessary condition for Assumtion ID is that θ 0 uniquely minimizes Qθ) over Θ. What values uniquely minimize Qθ) in the examles? The answer is as follows: where H = 1 0 0 0 I KT E 1 1 0 0 0 I KT E 2......... 1 0 0 0 I KT E T, E t = e t I K, and e t is a T -vector whose tth element is 1 and other elements are 0. The MD objective function is as defined in 13) with g θ) = Hθ and some user-chosen weight matrix A n. Xiaoxia Shi Page: 4
1) ML Estimator with iid observations): If the true density fw, θ 0 ) of the data is in the arametric family {fw, θ) : θ Θ}, then θ 0 minimizes Qθ): Qθ 0 ) Qθ) = E θ0 logfw i, θ)/fw i, θ 0 )) log E θ0 fw i, θ)/fw i, θ 0 ) fw, θ) = log fw, θ 0 ) fw, θ 0)dµw) = 0, where the inequality holds by Jensen s inequality because log ) is a concave function. The inequality is strict, and so, θ 0 uniquely minimizes Qθ) over Θ if and only if P fw i, θ) fw i, θ 0 )) > 0 θ Θ with θ θ 0. Suose the true density f ) of W i is not in the arametric family {f, θ) : θ Θ}. In this case, we roceed as follows. The Kullback Liebler Information Criterion KLIC) between f ) and f, θ) is defined by KLICf, f, θ)) = E f log fw i ) E f log fw i, θ), where E f denotes exectation when W i has density f. Note that KLICf, f, θ)) = E f log fw i ) + Qθ). In consequence, the ML estimator under missecification often called the quasi-ml estimator) converges in robability to the arameter value θ 0 that uniquely minimizes the KLIC between the true density f and the densities in the arametric family {f, θ) : θ Θ} rovided such a unique value exists). 2) LS Estimator: The true value θ 0 minimizes Qθ) if the regression model is correctly secified i.e., if there is a θ 0 Θ such that EY i X i ) = gx i, θ 0 ) a.s.): Qθ) Qθ 0 ) = EU i + gx i, θ 0 ) gx i, θ)) 2 /2 EUi 2 /2 = EgX i, θ 0 ) gx i, θ)) 2 /2 + EU i gx i, θ 0 ) gx i, θ)) = EgX i, θ 0 ) gx i, θ)) 2 /2 0. The inequality is strict, and so, θ 0 uniquely minimizes Qθ) over Θ if and only if P gx i, θ) gx i, θ 0 )) > 0 θ Θ with θ θ 0. Xiaoxia Shi Page: 5
Suose the nonlinear regression model is not correctly secified. Let gx) = EY i X i = x). Then, Qθ) = EY i gx i ) + gx i ) gx i, θ)) 2 /2 = EY i gx i )) 2 /2 + EgX i ) gx i, θ)) 2 /2 and a value θ 0 uniquely minimizes Qθ) over Θ if it uniquely minimizes EgX i ) gx i, θ)) 2 over θ Θ. That is, the LS estimator converges to the oint θ 0 that gives the best mean squared aroximation in the family {g, θ) : θ Θ} to the conditional mean of Y i given X i rovided the best aroximation is unique). 3) GMM Estimator: If A is nonsingular and there exists a unique value θ 0 Θ such that EgW i, θ 0 ) = 0, then θ 0 uniquely minimizes Qθ) over Θ. Suose the moment conditions are missecified and no value θ Θ is such that EgW i, θ) = 0. Then, θ 0 is the value that uniquely minimizes Qθ) = EgW i, θ) A AEgW i, θ)/2, if such a value exists. 4) MD Estimator: If A is nonsingular and there exists a unique value θ 0 Θ such that π 0 = gθ 0 ), then θ 0 uniquely minimizes Qθ) over Θ. Suose the restrictions are missecified and no value θ Θ is such that π 0 = gθ). Then, θ 0 is the value that uniquely minimizes Qθ) = π 0 gθ)) A Aπ 0 gθ))/2, if such a value exists. 5) TS Estimator: If A is nonsingular and there exists a unique value θ 0 Θ such that Gθ 0, τ 0 ) = 0, then θ 0 uniquely minimizes Qθ) over Θ. The following theorem gives sufficient conditions for θ n θ0. Theorem 3.1 Consistency): EE, ID, & U WCON θ n θ0. Proof of Theorem 31: Given ε > 0, δ > 0 such that θ / Bθ 0, ε) Qθ) Qθ 0 ) δ > 0. Thus, P θn / Bθ 0, ε)) P Q θ n ) Qθ 0 ) δ) = P Q θ n ) ˆQ n θ n ) + ˆQ n θ n ) Qθ 0 ) δ) P Q θ n ) ˆQ n θ n ) + ˆQ n θ 0 ) + o 1) Qθ 0 ) δ) P 2 su ˆQ n θ) Qθ) + o 1) δ) 0. 18) θ Θ Xiaoxia Shi Page: 6
Primitive Sufficient Conditions for the Identification Condition The following assumtion is sufficient for Assumtion ID. Assumtion ID1: i) Θ is comact. ii) Qθ) is continuous on Θ. iii) θ 0 uniquely minimizes Qθ) over θ Θ. Lemma 3.1: ID1 ID. Proof of Lemma 3.1: Problem Set Question.) Hint: Use a roof by contradiction. Note that none of the three conditions of Assumtion ID1 is redundant. If Θ is non-comact, Qθ) is not continuous, or Qθ) is not uniquely minimized at θ 0, then Assumtion ID fails. On the other hand, only Assumtion ID1iii) is a necessary condition for Assumtion ID. When does Assumtion ID1 hold in the examles? Obviously ID1i) holds in each examle if Θ is comact. We have already given conditions under which ID1iii) holds in each examle. To verify ID1ii) in the examles, the following lemma is useful. Lemma 3.2: Suose mw i, θ) is continuous in θ at each θ Θ with robability one and E su θ Θ mw i, θ) <. Then, EmW i, θ) is continuous in θ on Θ. For examle, when ˆQ n θ) is of the form ˆQ n θ) = n 1 mw i, θ), then Qθ) = EmW i, θ). This occurs in the ML and nonlinear regression examles. In the GMM examle, ˆQn θ) is of the form ˆQ n θ) = A n n 1 mw i, θ) 2 /2 and Qθ) = AEmW i, θ) 2 /2. In this case too, continuity of EmW i, θ) yields continuity of Qθ). Proof of Lemma 3.2: The result holds by the dominated convergence theorem with the dominating function given by su θ Θ mw i, θ), since lim mw i, θ ) = mw i, θ) a.s. and 19) θ θ E su mw i, θ) <. 20) θ Θ Using Lemma 3.2, we see that ID1ii) holds for the 1) ML estimator if fw, θ) is continuous in θ at each θ Θ with robability one and E su θ Θ log fw i, θ) <, 2) LS estimator if gx, θ) is continuous in θ at each θ Θ with robability one and E su θ Θ Y i gx i, θ)) 2 < or, equivalently, Eε 2 i < and E su θ Θ gx i, θ 0 ) gx i, θ)) 2 < ), and 3) GMM estimator if gw, θ) is continuous in θ at each θ Θ with robability one and E su θ Θ gw i, θ) <. Xiaoxia Shi Page: 7
Assumtion ID1ii) holds for the MD and TS estimators if gθ) and Gθ, τ 0 ) are continuous in θ on Θ resectively. Consistency without Point Identification In recent years, there have been considerable interest in artially identified models, in which the arameters cannot be inned down to a unique oint even if one had infinite amount of data. In the context of extreme estimation, artial identification amounts to the failure of Assumtion ID. Tyically, Assumtion ID is relaced by a set identification condition: Assumtion SetID: There exists Θ 0 Θ such that su θ Θ0 Qθ) = inf θ Θ Qθ) and ε > 0, inf θ / BΘ0,ε) Qθ) > Qθ 0 ) The set Θ 0 is called the identified set. Assumtion EE is relaced with a set estimator Θ n. Consistency of the set estimator tyically means consistency in the Hausdorff distance, where the Hausdorff distance is a distance measure between two sets: d H A, B) = su inf a b + su inf a b. 21) a A b B b B a A A natural set estimator is the argmin set of ˆQ n θ). However, under Assumtions U-WCON and SetID, this set estimator is not necessarily consistent in Hausdorff distance. It is not difficult to show roblem set question) that it is half-hausdorff consistent under these two conditions, i.e., su inf θ θ 0. 22) θ Θ 0 θ Θ n That is: the argmin set aroaches the identified set, but it might not include enough oints to aroach every oint in the identified set. There have been different roosals for solving this roblem: 1. Hausdorff consistency of the argmin set can be roved in the secial case where ˆQ n θ) degenerate in the interior of Θ 0 and the closure of the interior equals Θ 0, i.e.,su θ Θ ε ˆQ n θ) = 0 inf θ Θ ˆQn θ) with robability aroaching 1 for any ε > 0 and Θ ε 0 being the set of oints that are at least ε away from the comlement of Θ 0, and lim ε 0 d H Θ 0, Θ ε 0 ) = 0. This is a secial case of Chernozhukov, Hong and Tamer 2007, Condition C.3) 2. Use a bigger set: let Θ n be all oints such that ˆQ n θ) inf θ Θ ˆQn θ) + τ n where τ n is a ositive sequence that converges to zero at an aroriate rate. A stronger convergence rate) assumtion is needed to relace U-WCON for such estimators to be consistent. Chernozhukov, Hong and Tamer 2007, Econometrica),Theorem 3.1) 3. Abandon consistent estimation all together and focus on confidence set. The argument for this roosal is that the identified set is not the true arameter and thus is not the arameter of Xiaoxia Shi Page: 8
interest anyway, while a confidence set for the true arameter can be constructed and interreted in the conventional sense. Andrews and Soares 2010, Econometrica), Andrews and Guggenberger 2010, Econometrica), etc.) 4. Make assumtions so that certain features of the identified set like the uer left corner) is oint identified and focus on such features. Pakes, Porter, Ho and Ishii 2012, Econometrica)) 5. Bayesian analysis, where there is no notion of artial identification. Liao and Jiang 2010, Annals of Statistics)) Below I describe Chernozhukov, Hong and Tamer 2007) s results regarding solutions 1 and 2. But before doing so, here are a coule of motivating examles of artially identified models. Examle. Interval Outcome in Regression Model. Suose that one has a regression model: Y = X β 0 + ε, EεX) = 0, 23) where Y is the outcome variable and X is the regressors. Assume that variables in X are all nonnegative. Suose that Y is unobserved. Only an interval that includes Y is observed: [Y l, Y u ]. No addition information on how [Y l, Y u ] is related to Y is available. Then the model imlies and is imlied by) the following moment inequalities: E[XY u XX β 0 ] 0, and E[XX β 0 XY l ] 0. 24) One criterion function used for estimating the identified set of β 0 is: ˆQ n β) = A n [ m n β)] 2, 25) where A n is a weighting matrix,[x] = minx, 0, and m n β) = n 1 n n ) X i Y u,i X i X iβ), n 1 X i X iβ X i Y l,i ), 26) The function ˆQ n β) converges under aroriate conditions) uniformly to the function Qθ) = A mθ) 2, where mθ) = The function Qθ) does not have unique minimum. E[XY u XX β], E[XX β XY l ] ). 27) Proosition 3.1. Under P-WCON, IDSet and the degeneracy condition described in solution 1 above, the argmin set Θ n = {θ : ˆQ n θ) = inf θ Θ ˆQn θ)} is consistent in Hausdorff distance. Xiaoxia Shi Page: 9
Proof. The Hausdorff consistency is imlied by Equation 22) and su inf θ θ 0. 28) θ Θ n The roof of Equation 22) is left as a roblem set question. Here we rove 28). Consider the following derivation: for any ε > 0, Pr su inf θ θ > ε) θ Θ n Pr su inf θ Θ δ 0 θ θ > ε ) + PrΘ δ 0 Θ n ). 29) For any fixed δ > 0, the second term of the RHS of the above dislay converges to zero as n. The first term can be made arbitrarily small by choosing δ small enough. Therefore the limit of the LHS is zero as n. The roblem with solution 1 is that the degeneracy condition often is not satisfied and certainly is tricky to verify. The second solution needs more assumtions on the convergence of the samle criterion functions. The additional assumtion is given below as Assumtion RT, where RT stands for rate of convergence). Part a) of this assumtion is without loss of generality because one can always recenter the criterion functions by their infimum and make this assumtion hold. In art b), the rate n can be relaced with any other rate, as long as the τ n choice in the roosition below adjusts with this rate accordingly. Assumtion RT a)inf θ Θ ˆQn θ) inf θ Θ Qθ) and b) n su θ Θ0 ˆQ n θ) Qθ) = O 1). Proosition 3.2. Under Assumtions U-WCON, RT and IDset, if nτ n, then the augmented argmin set defined in solution 2 above is consistent in Hausdorff distance. Proof. As in the revious roostion, we only rove 28). It suffices to show that Θ 0 Θ n with robability aroaching 1. Consider the following derivation: PrΘ 0 Θ n ) Pr su Pr Pr Pr su su ) ˆQn θ) > inf ˆQ n θ) + τ n θ Θ ) ˆQn θ) inf Qθ) > τ n θ Θ ) ˆQ n θ) Qθ) > τ n n su ˆQ n θ) Qθ) > nτ n ). 30) Xiaoxia Shi Page: 10
The last line of the above dislay goes to zero as n due to Assumtion RTb). The assumtions required for solution 2 are innocuous enough, but it leaves the choice of τ n oen. Other than a loose rate requirement, there is no ractical way to choose τ n. As a result, one man s set estimator can be 10 times bigger than another man s simly due to different choices of τ n. Xiaoxia Shi Page: 11