δ -method and M-estimation

Similar documents
Institute of Actuaries of India

Comprehensive Examination Quantitative Methods Spring, 2018

Ordinary Least Squares Regression

the error term could vary over the observations, in ways that are related

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Inference in non-linear time series

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Regression Models - Introduction

Asymptotic Theory. L. Magee revised January 21, 2013

Graduate Econometrics I: Maximum Likelihood I

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Qualifying Exam in Probability and Statistics.

Estimation of Dynamic Regression Models

A Very Brief Summary of Statistical Inference, and Examples

Economics 583: Econometric Theory I A Primer on Asymptotics

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Stat 5101 Lecture Notes

Analogy Principle. Asymptotic Theory Part II. James J. Heckman University of Chicago. Econ 312 This draft, April 5, 2006

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Econ 583 Homework 7 Suggested Solutions: Wald, LM and LR based on GMM and MLE

Econometrics I, Estimation

Regression and Statistical Inference

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Testing Restrictions and Comparing Models

Econ 2148, spring 2019 Statistical decision theory

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Chapter 4: Asymptotic Properties of the MLE

Lecture 14 Simple Linear Regression

Simple Linear Regression

Review. December 4 th, Review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

11 Survival Analysis and Empirical Likelihood

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Econ 583 Final Exam Fall 2008

Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, / 91. Bruce E.

Linear Regression. Junhui Qian. October 27, 2014

Multivariate Analysis and Likelihood Inference

Economics 620, Lecture 9: Asymptotics III: Maximum Likelihood Estimation

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Ch 2: Simple Linear Regression

Notes on Asymptotic Theory: Convergence in Probability and Distribution Introduction to Econometric Theory Econ. 770

1 Extremum Estimators

Lecture notes on statistical decision theory Econ 2110, fall 2013

Economics 620, Lecture 18: Nonlinear Models

Uniformity and the delta method

ML Testing (Likelihood Ratio Testing) for non-gaussian models

DA Freedman Notes on the MLE Fall 2003

Maximum Likelihood Asymptotic Theory. Eduardo Rossi University of Pavia

Information in a Two-Stage Adaptive Optimal Design

Section 8.2. Asymptotic normality

Regression Models - Introduction

CAS MA575 Linear Models

Final Exam. Question 1 (20 points) 2 (25 points) 3 (30 points) 4 (25 points) 5 (10 points) 6 (40 points) Total (150 points) Bonus question (10)

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Economics 582 Random Effects Estimation

Simple and Multiple Linear Regression

STAT 100C: Linear models

where x and ȳ are the sample means of x 1,, x n

A Very Brief Summary of Statistical Inference, and Examples

STA 2101/442 Assignment 3 1

Why experimenters should not randomize, and what they should do instead

simple if it completely specifies the density of x

Asymptotic Statistics-VI. Changliang Zou

IV Estimation and its Limitations: Weak Instruments and Weakly Endogeneous Regressors

Topic 7: HETEROSKEDASTICITY

Least Squares Estimation-Finite-Sample Properties

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

Master s Written Examination

Statistical inference

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

Probability and Statistics Notes

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Econ 2120: Section 2

Lecture 8 Inequality Testing and Moment Inequality Models

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Statistical Inference

The Simple Linear Regression Model

Linear regression COMS 4771

Ph.D. Preliminary Examination Statistics June 2, 2014

Econometrics II - EXAM Answer each question in separate sheets in three hours

Econometrics. Week 8. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

Formal Statement of Simple Linear Regression Model

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Lecture 6: Linear Regression

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Quantitative Methods in Economics Conditional Expectations

Uses of Asymptotic Distributions: In order to get distribution theory, we need to norm the random variable; we usually look at n 1=2 ( X n ).

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Linear models and their mathematical foundations: Simple linear regression

MS&E 226: Small Data

Transcription:

Econ 2110, fall 2016, Part IVb Asymptotic Theory: δ -method and M-estimation Maximilian Kasy Department of Economics, Harvard University 1 / 40

Example Suppose we estimate the average effect of class size on student exam grades, using the project STAR data. What is the variance of our estimator? Can we form a confidence set for the size of the effect? Can we reject the null hypothesis of a zero average effect? Also if exam scores are not normally distributed? 2 / 40

Example Suppose we estimate the top 1% income share using data on the number of individuals in different tax brackets, assuming that top incomes are Pareto distributed. Suppose we calculate the implied optimal top tax rate. Can we form a 95% confidence interval for this optimal tax rate? 3 / 40

Takeaways for this part of class How we get our formulas for standard deviations in many settings. When and why we can expect asymptotic normality for many estimators (and what that means). When we might expect problems to arise for asymptotic approximations. 4 / 40

Roadmap IVa IVb Types of convergence Laws of large numbers (LLN) and central limit theorems (CLT) The delta method M- and Z -Estimators Special M-Estimators Ordinary least squares (OLS) Maximum likelihood estimation (MLE) Confidence sets 5 / 40

Part IVb The delta method M- and Z-Estimators Consistency Asymptotic normality Special M-Estimators Least squares Maximum likelihood Confidence sets 6 / 40

The delta method The delta method Suppose we know the asymptotic behavior of sequence X n, we are interested in Y n = g(x n ), and g is smooth. Often a Taylor expansion of g around the probability limit of X n yields the answer, where we can ignore higher order terms in the limit. Y n = g(β) + g (β) (X n β) + o( X n β ). This idea is called the delta method. 7 / 40

The delta method Theorem (Delta method) Assume that r n (X n β) d X for some sequence r n and some random variable X. Let Y n = g(x n ) for a function g which is differentiable at β. Then r n (Y n g(β)) d g (β) X. 8 / 40

The delta method Proof: By differentiability of g, Rearranging gives Y n = g(β) + g (β) (X n β) + o( X n β ). r n (Y n g(β)) = r n g (β) (X n β) + r n o( X n β ). The second term vanishes asymptotically, since r n (X n β) converges in distribution. The continuous mapping theorem applied to matrix multiplication by g (β) now yields the claim. 9 / 40

The delta method Leading special case Let X n be a sequence of random variables such that n(xn b) d N (0,σ 2 ). Let g : R R be continuously differentiable at a. Then n(g(xn ) g(b)) d N (0,(g (b)) 2 σ 2 ). 10 / 40

The delta method Attention There are important cases where the delta method provides poor approximations Examples: near β = 0, for 1. g(x) = X 2. g(x) = 1/X 3. g(x) = X Relevant for: 1. weak instruments 2. inference under partial identification / moment inequalities 11 / 40

The delta method Practice problem Suppose X i are iid with mean 1 and variance 2, and n = 25. Let Y = X 2. Provide an approximation for the distribution of Y. Now suppose X i has mean 0 and variance 2. Provide an approximation for the distribution of Y. 12 / 40

M- and Z-Estimators M- and Z-Estimators Many interesting objects β can be written in the form β 0 = argmax β This defines a mapping from the probability distribution of X to a parameter β. In our decision theory notation: β 0 = β(θ) E[m(β,X)]. (1) 13 / 40

M- and Z-Estimators Example - Least squares The coefficients β 0 of the best linear predictor Ŷ = X β 0 minimize the average squared prediction error, β 0 = argmin β E[(Y X β) 2 ]. Thus m(β,x,y ) = (Y X β) 2. 14 / 40

M- and Z-Estimators Example - Maximum likelihood Suppose Y is distributed according to the density Y f(y,β 0 ). Then β 0 maximizes the expected log likelihood, We will show this later. β 0 = argmax β E[log(f(Y,β))]. Thus m(β,x) = log(f(y,β)). 15 / 40

M- and Z-Estimators M-Estimator Use E n to denote the sample average, e.g. E n [X] = 1 n n i=1 We can define an estimator for β which solves the analogous conditions replacing the population expectation by a sample average, that is β = argmax β X i. E n [m(β,x)]. (2) Such an estimator is called an M-estimator (for maximizer ). 16 / 40

M- and Z-Estimators Examples continued 1. Least squares: ordinary least squares (OLS) estimator β = argmin β 2. Maximum likelihood: maximum likelihood estimator (MLE) β = argmax β E n [(Y X β) 2 ] E n [log(f(y,β))] 17 / 40

M- and Z-Estimators Z-Estimator If m is differentiable and β is an interior maximizer equation (1) implies the first order conditions β E[m(β,X)] = E[m (β 0,X)] = 0. If we directly define the estimator via then β is called a Z-estimator (for zero ). E n [m ( β,x i )] = 0, (3) 18 / 40

M- and Z-Estimators Practice problem Find the first order conditions for MLE and for OLS 19 / 40

M- and Z-Estimators Solution: 1. Least squares: where E n [ê X] = 0 ê = Y X β is the regression residual. 2. Maximum likelihood: where is called the score. E n [S(Y, β) ] = 0 S(Y,β) := log(f(y,β)) β 20 / 40

M- and Z-Estimators Consistency Consistency Basic requirement for good estimators: That they are close to the population estimand with large probability as sample sizes get large: P( β β 0 < ε) 1 ε. Thus: β p β 0 This property is called consistency. 21 / 40

M- and Z-Estimators Consistency Theorem (Consistency of M-Estimators) M-estimators are consistent if 1. 2. sup E n [m(β,x)] E[m(β,X)] p 0 β sup E[m(β,X)] < E[m(β 0,X)]. β: β β 0 >ε The first condition holds in many case by some uniform law of large numbers. The second condition states that the maximum is well separated. 22 / 40

M- and Z-Estimators Consistency Figure: Proof of consistency δ δ E[m(β,X)] β 0 ε ε β 23 / 40

M- and Z-Estimators Consistency Sketch of proof: By assumption (2), for every ε there is a δ, such that if then β β 0 < ε. By assumption (1), sup E n [m(β,x)] E[m(β,X)] < δ β sup E n [m(β,x)] E[m(β,X)] < δ β happens with probability going to 1 as n. 24 / 40

M- and Z-Estimators Asymptotic normality Asymptotic normality What is the (approximate) distribution of M-estimators? Consistency just states that they converge to a point. But if we blow up the scale appropriately? For instance by n? Then we get convergence to a normal distribution! 25 / 40

M- and Z-Estimators Asymptotic normality Theorem Under suitable differentiability conditions, M-estimators and Z-estimators are asymptotically normal, n( β β0 ) d N(0,V) for some V. 26 / 40

M- and Z-Estimators Asymptotic normality Figure: Proof of asymptotic normality E[m'(β,X)] E n [m'(β,x)] β^ β 0 β E n [m'(β 0,X)] 27 / 40

M- and Z-Estimators Asymptotic normality Sketch of proof: follows by arguments similar to our derivation of the delta method. if m is twice differentiable, by the intermediate value theorem 0 = E n [m ( β,x)] = E n [m (β 0,X)] + E n [m ( β,x)] ( β β 0 ) for some β between β and β 0. Rearranging yields n( β β0 ) = ( E n [m ( β,x)]) 1 nen [m (β 0,X)]. Consistency of β and a uniform law of large numbers for m imply ( E n [m ( β,x)]) 1 p ( E[m (β 0,X)] ) 1. 28 / 40

M- and Z-Estimators Asymptotic normality The central limit theorem implies nen [m (β 0,X)] d N(0,Var(m (β 0,X))). Slutsky s theorem then yields the asymptotic distribution of β as n( β β0 ) d N(0,V) where V = ( E[m (β 0,X)] ) 1 Var(m (β 0,X)) (E[m (β 0,X)] ) 1. (4) 29 / 40

M- and Z-Estimators Asymptotic normality Estimators of the asymptotic variance Asymptotic variance: sandwich form Estimators for this variance: sample analogs of both components For instance: ( ) 1 ] ( ) 1 V = E n [m ( β,x)] En [(m ( β,x)) 2 E n [m ( β,x)] This is the kind of variance estimator you get when you type, robust after some estimation commands in Stata. 30 / 40

Special M-Estimators Least squares Least squares Recall OLS: First order condition: β = argmin β E n [(Y X β) 2 ] E n [e X] = 0 where e := Y X β In our general notation: m(y,x,β) = e 2 = (Y X β) 2 m (Y,X,β) = 2e X m (Y,X,β) = 2 XX t 31 / 40

Special M-Estimators Least squares Apply the asymptotic results for general M-estimators β is consistent for β0, the best linear predictor, β 0 = argmin β E[(Y X β) 2 ]. β is asymptotically normal n ( β β0 ) d N(0,V) 32 / 40

Special M-Estimators Least squares Asymptotic variance V = ( E[m (β 0,X)] ) 1 Var(m (β 0,X)) (E[m (β 0,X)] ) 1 = E[XX t ] 1 E[e 2 XX t ] E[XX t ] 1 heteroskedasticity robust variance estimator for ordinary least squares: 1 n E n[xx t ] 1 E n [ê 2 XX t ] E n [XX t ] 1 (5) Factor of 1/n to get variance of β rather than n β 33 / 40

Special M-Estimators Maximum likelihood Maximum likelihood Lemma Suppose Y f(y,β 0 ), where f denotes a family of densities indexed by β. Then E[log(f(Y,β 0 ))] E[log(f(Y,β))]. (6) The inequality is strict if f(y,β 0 ) f(y,β) with positive probability. 34 / 40

Special M-Estimators Maximum likelihood Sketch of proof: Want to show: 0 log(f(y,β))f(y,β 0 )dy ( ) f(y,β) = log f(y,β 0 )dy. f(y,β 0 ) log(f(y,β 0 ))f(y,β 0 )dy Jensen s inequality, applied to the concave function log: ( ) f(y,β) log f(y,β 0 )dy f(y,β 0 ) ( ) f(y,β) log f(y,β 0 ) f(y,β 0)dy =log(1) = 0. 35 / 40

Special M-Estimators Maximum likelihood Terminology for maximum likelihood Log likelihood: L n (β) = n E n [m(y,β)] = log(f(y i,β)) i Score: Information: S i (β) = m (Y i,β) = β log(f(y i,β)) I(β) = E[m (Y,β)] = E[ S/ β] 36 / 40

Special M-Estimators Maximum likelihood Lemma If Y i f(y,β 0 ), then Var(S(β 0 )) = I(β 0 ) = E[ S(β 0 )/ β]. Proof: Differentiate 0 = E[S] = S(y,β 0 )f(y,β 0 )dy with respect to β 0 to get 0 = S fdy + Sf dy = E[S ] + E[S 2 ]. But: Parametric models are usually wrong. So don t trust this equality. If it holds, the asymptotic variance for the MLE simplifies to V = E[S ] 1 E[S 2 ] E[S ] 1 = I(β 0 ) 1. 37 / 40

Confidence sets Confidence sets Confidence set C: a set of β s, which is calculated as a function of data Y Confidence set C for β of level α: for all distributions of Y (i.e., all θ ) and corresponding β 0. P(β 0 C) 1 α. (7) In this expression β 0 is fixed and C is random. Confidence set C n for β of asymptotic level α: lim P(β C n) 1 α. (8) n 38 / 40

Confidence sets Confidence sets for M-estimators can use asymptotic normality to get asymptotic confidence sets Suppose n( β β0 ) d N(0,V) V p V Define β := n V 1/2 ( β β 0 ). Slutsky s theorem and therefore where k = dim(β). β d N(0,I), β 2 d χ 2 k, 39 / 40

Confidence sets Let χ 2 k,1 α be the 1 α quantile of the χ 2 k distribution. Define { C n = β : } n V 1/2 ( β β) 2 χk,1 α 2. (9) We get P(β 0 C n ) 1 α. C n is a confidence set for β of asymptotic level α. C n is an ellipsoid. 40 / 40