Econ 2110, fall 2016, Part IVb Asymptotic Theory: δ -method and M-estimation Maximilian Kasy Department of Economics, Harvard University 1 / 40
Example Suppose we estimate the average effect of class size on student exam grades, using the project STAR data. What is the variance of our estimator? Can we form a confidence set for the size of the effect? Can we reject the null hypothesis of a zero average effect? Also if exam scores are not normally distributed? 2 / 40
Example Suppose we estimate the top 1% income share using data on the number of individuals in different tax brackets, assuming that top incomes are Pareto distributed. Suppose we calculate the implied optimal top tax rate. Can we form a 95% confidence interval for this optimal tax rate? 3 / 40
Takeaways for this part of class How we get our formulas for standard deviations in many settings. When and why we can expect asymptotic normality for many estimators (and what that means). When we might expect problems to arise for asymptotic approximations. 4 / 40
Roadmap IVa IVb Types of convergence Laws of large numbers (LLN) and central limit theorems (CLT) The delta method M- and Z -Estimators Special M-Estimators Ordinary least squares (OLS) Maximum likelihood estimation (MLE) Confidence sets 5 / 40
Part IVb The delta method M- and Z-Estimators Consistency Asymptotic normality Special M-Estimators Least squares Maximum likelihood Confidence sets 6 / 40
The delta method The delta method Suppose we know the asymptotic behavior of sequence X n, we are interested in Y n = g(x n ), and g is smooth. Often a Taylor expansion of g around the probability limit of X n yields the answer, where we can ignore higher order terms in the limit. Y n = g(β) + g (β) (X n β) + o( X n β ). This idea is called the delta method. 7 / 40
The delta method Theorem (Delta method) Assume that r n (X n β) d X for some sequence r n and some random variable X. Let Y n = g(x n ) for a function g which is differentiable at β. Then r n (Y n g(β)) d g (β) X. 8 / 40
The delta method Proof: By differentiability of g, Rearranging gives Y n = g(β) + g (β) (X n β) + o( X n β ). r n (Y n g(β)) = r n g (β) (X n β) + r n o( X n β ). The second term vanishes asymptotically, since r n (X n β) converges in distribution. The continuous mapping theorem applied to matrix multiplication by g (β) now yields the claim. 9 / 40
The delta method Leading special case Let X n be a sequence of random variables such that n(xn b) d N (0,σ 2 ). Let g : R R be continuously differentiable at a. Then n(g(xn ) g(b)) d N (0,(g (b)) 2 σ 2 ). 10 / 40
The delta method Attention There are important cases where the delta method provides poor approximations Examples: near β = 0, for 1. g(x) = X 2. g(x) = 1/X 3. g(x) = X Relevant for: 1. weak instruments 2. inference under partial identification / moment inequalities 11 / 40
The delta method Practice problem Suppose X i are iid with mean 1 and variance 2, and n = 25. Let Y = X 2. Provide an approximation for the distribution of Y. Now suppose X i has mean 0 and variance 2. Provide an approximation for the distribution of Y. 12 / 40
M- and Z-Estimators M- and Z-Estimators Many interesting objects β can be written in the form β 0 = argmax β This defines a mapping from the probability distribution of X to a parameter β. In our decision theory notation: β 0 = β(θ) E[m(β,X)]. (1) 13 / 40
M- and Z-Estimators Example - Least squares The coefficients β 0 of the best linear predictor Ŷ = X β 0 minimize the average squared prediction error, β 0 = argmin β E[(Y X β) 2 ]. Thus m(β,x,y ) = (Y X β) 2. 14 / 40
M- and Z-Estimators Example - Maximum likelihood Suppose Y is distributed according to the density Y f(y,β 0 ). Then β 0 maximizes the expected log likelihood, We will show this later. β 0 = argmax β E[log(f(Y,β))]. Thus m(β,x) = log(f(y,β)). 15 / 40
M- and Z-Estimators M-Estimator Use E n to denote the sample average, e.g. E n [X] = 1 n n i=1 We can define an estimator for β which solves the analogous conditions replacing the population expectation by a sample average, that is β = argmax β X i. E n [m(β,x)]. (2) Such an estimator is called an M-estimator (for maximizer ). 16 / 40
M- and Z-Estimators Examples continued 1. Least squares: ordinary least squares (OLS) estimator β = argmin β 2. Maximum likelihood: maximum likelihood estimator (MLE) β = argmax β E n [(Y X β) 2 ] E n [log(f(y,β))] 17 / 40
M- and Z-Estimators Z-Estimator If m is differentiable and β is an interior maximizer equation (1) implies the first order conditions β E[m(β,X)] = E[m (β 0,X)] = 0. If we directly define the estimator via then β is called a Z-estimator (for zero ). E n [m ( β,x i )] = 0, (3) 18 / 40
M- and Z-Estimators Practice problem Find the first order conditions for MLE and for OLS 19 / 40
M- and Z-Estimators Solution: 1. Least squares: where E n [ê X] = 0 ê = Y X β is the regression residual. 2. Maximum likelihood: where is called the score. E n [S(Y, β) ] = 0 S(Y,β) := log(f(y,β)) β 20 / 40
M- and Z-Estimators Consistency Consistency Basic requirement for good estimators: That they are close to the population estimand with large probability as sample sizes get large: P( β β 0 < ε) 1 ε. Thus: β p β 0 This property is called consistency. 21 / 40
M- and Z-Estimators Consistency Theorem (Consistency of M-Estimators) M-estimators are consistent if 1. 2. sup E n [m(β,x)] E[m(β,X)] p 0 β sup E[m(β,X)] < E[m(β 0,X)]. β: β β 0 >ε The first condition holds in many case by some uniform law of large numbers. The second condition states that the maximum is well separated. 22 / 40
M- and Z-Estimators Consistency Figure: Proof of consistency δ δ E[m(β,X)] β 0 ε ε β 23 / 40
M- and Z-Estimators Consistency Sketch of proof: By assumption (2), for every ε there is a δ, such that if then β β 0 < ε. By assumption (1), sup E n [m(β,x)] E[m(β,X)] < δ β sup E n [m(β,x)] E[m(β,X)] < δ β happens with probability going to 1 as n. 24 / 40
M- and Z-Estimators Asymptotic normality Asymptotic normality What is the (approximate) distribution of M-estimators? Consistency just states that they converge to a point. But if we blow up the scale appropriately? For instance by n? Then we get convergence to a normal distribution! 25 / 40
M- and Z-Estimators Asymptotic normality Theorem Under suitable differentiability conditions, M-estimators and Z-estimators are asymptotically normal, n( β β0 ) d N(0,V) for some V. 26 / 40
M- and Z-Estimators Asymptotic normality Figure: Proof of asymptotic normality E[m'(β,X)] E n [m'(β,x)] β^ β 0 β E n [m'(β 0,X)] 27 / 40
M- and Z-Estimators Asymptotic normality Sketch of proof: follows by arguments similar to our derivation of the delta method. if m is twice differentiable, by the intermediate value theorem 0 = E n [m ( β,x)] = E n [m (β 0,X)] + E n [m ( β,x)] ( β β 0 ) for some β between β and β 0. Rearranging yields n( β β0 ) = ( E n [m ( β,x)]) 1 nen [m (β 0,X)]. Consistency of β and a uniform law of large numbers for m imply ( E n [m ( β,x)]) 1 p ( E[m (β 0,X)] ) 1. 28 / 40
M- and Z-Estimators Asymptotic normality The central limit theorem implies nen [m (β 0,X)] d N(0,Var(m (β 0,X))). Slutsky s theorem then yields the asymptotic distribution of β as n( β β0 ) d N(0,V) where V = ( E[m (β 0,X)] ) 1 Var(m (β 0,X)) (E[m (β 0,X)] ) 1. (4) 29 / 40
M- and Z-Estimators Asymptotic normality Estimators of the asymptotic variance Asymptotic variance: sandwich form Estimators for this variance: sample analogs of both components For instance: ( ) 1 ] ( ) 1 V = E n [m ( β,x)] En [(m ( β,x)) 2 E n [m ( β,x)] This is the kind of variance estimator you get when you type, robust after some estimation commands in Stata. 30 / 40
Special M-Estimators Least squares Least squares Recall OLS: First order condition: β = argmin β E n [(Y X β) 2 ] E n [e X] = 0 where e := Y X β In our general notation: m(y,x,β) = e 2 = (Y X β) 2 m (Y,X,β) = 2e X m (Y,X,β) = 2 XX t 31 / 40
Special M-Estimators Least squares Apply the asymptotic results for general M-estimators β is consistent for β0, the best linear predictor, β 0 = argmin β E[(Y X β) 2 ]. β is asymptotically normal n ( β β0 ) d N(0,V) 32 / 40
Special M-Estimators Least squares Asymptotic variance V = ( E[m (β 0,X)] ) 1 Var(m (β 0,X)) (E[m (β 0,X)] ) 1 = E[XX t ] 1 E[e 2 XX t ] E[XX t ] 1 heteroskedasticity robust variance estimator for ordinary least squares: 1 n E n[xx t ] 1 E n [ê 2 XX t ] E n [XX t ] 1 (5) Factor of 1/n to get variance of β rather than n β 33 / 40
Special M-Estimators Maximum likelihood Maximum likelihood Lemma Suppose Y f(y,β 0 ), where f denotes a family of densities indexed by β. Then E[log(f(Y,β 0 ))] E[log(f(Y,β))]. (6) The inequality is strict if f(y,β 0 ) f(y,β) with positive probability. 34 / 40
Special M-Estimators Maximum likelihood Sketch of proof: Want to show: 0 log(f(y,β))f(y,β 0 )dy ( ) f(y,β) = log f(y,β 0 )dy. f(y,β 0 ) log(f(y,β 0 ))f(y,β 0 )dy Jensen s inequality, applied to the concave function log: ( ) f(y,β) log f(y,β 0 )dy f(y,β 0 ) ( ) f(y,β) log f(y,β 0 ) f(y,β 0)dy =log(1) = 0. 35 / 40
Special M-Estimators Maximum likelihood Terminology for maximum likelihood Log likelihood: L n (β) = n E n [m(y,β)] = log(f(y i,β)) i Score: Information: S i (β) = m (Y i,β) = β log(f(y i,β)) I(β) = E[m (Y,β)] = E[ S/ β] 36 / 40
Special M-Estimators Maximum likelihood Lemma If Y i f(y,β 0 ), then Var(S(β 0 )) = I(β 0 ) = E[ S(β 0 )/ β]. Proof: Differentiate 0 = E[S] = S(y,β 0 )f(y,β 0 )dy with respect to β 0 to get 0 = S fdy + Sf dy = E[S ] + E[S 2 ]. But: Parametric models are usually wrong. So don t trust this equality. If it holds, the asymptotic variance for the MLE simplifies to V = E[S ] 1 E[S 2 ] E[S ] 1 = I(β 0 ) 1. 37 / 40
Confidence sets Confidence sets Confidence set C: a set of β s, which is calculated as a function of data Y Confidence set C for β of level α: for all distributions of Y (i.e., all θ ) and corresponding β 0. P(β 0 C) 1 α. (7) In this expression β 0 is fixed and C is random. Confidence set C n for β of asymptotic level α: lim P(β C n) 1 α. (8) n 38 / 40
Confidence sets Confidence sets for M-estimators can use asymptotic normality to get asymptotic confidence sets Suppose n( β β0 ) d N(0,V) V p V Define β := n V 1/2 ( β β 0 ). Slutsky s theorem and therefore where k = dim(β). β d N(0,I), β 2 d χ 2 k, 39 / 40
Confidence sets Let χ 2 k,1 α be the 1 α quantile of the χ 2 k distribution. Define { C n = β : } n V 1/2 ( β β) 2 χk,1 α 2. (9) We get P(β 0 C n ) 1 α. C n is a confidence set for β of asymptotic level α. C n is an ellipsoid. 40 / 40