Measurement Error. Often a data set will contain imperfect measures of the data we would ideally like.

Similar documents
Handout 11: Measurement Error

Lecture 14. More on using dummy variables (deal with seasonality)

Handout 12. Endogeneity & Simultaneous Equation Models

Econometrics. 8) Instrumental variables

1. You have data on years of work experience, EXPER, its square, EXPER2, years of education, EDUC, and the log of hourly wages, LWAGE

Problem Set 10: Panel Data

Autocorrelation. Think of autocorrelation as signifying a systematic relationship between the residuals measured at different points in time

Lab 07 Introduction to Econometrics

Essential of Simple regression

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

4 Instrumental Variables Single endogenous variable One continuous instrument. 2

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Binary Dependent Variables

Lecture 4: Multivariate Regression, Part 2

ECO220Y Simple Regression: Testing the Slope

Econometrics Midterm Examination Answers

Warwick Economics Summer School Topics in Microeconometrics Instrumental Variables Estimation

Lab 11 - Heteroskedasticity

Fixed and Random Effects Models: Vartanian, SW 683

An explanation of Two Stage Least Squares

Instrumental Variables, Simultaneous and Systems of Equations

Graduate Econometrics Lecture 4: Heteroskedasticity

ECON Introductory Econometrics. Lecture 6: OLS with Multiple Regressors

Empirical Application of Simple Regression (Chapter 2)

Эконометрика, , 4 модуль Семинар Для Группы Э_Б2015_Э_3 Семинарист О.А.Демидова

THE MULTIVARIATE LINEAR REGRESSION MODEL

Heteroskedasticity. Occurs when the Gauss Markov assumption that the residual variance is constant across all observations in the data set

ECON3150/4150 Spring 2016

Lecture 8: Instrumental Variables Estimation

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

ECON3150/4150 Spring 2015

Nonrecursive Models Highlights Richard Williams, University of Notre Dame, Last revised April 6, 2015

Heteroskedasticity. (In practice this means the spread of observations around any given value of X will not now be constant)

ECON3150/4150 Spring 2016

Lecture 4: Multivariate Regression, Part 2

INTRODUCTION TO BASIC LINEAR REGRESSION MODEL

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

General Linear Model (Chapter 4)

ECON Introductory Econometrics. Lecture 13: Internal and external validity

Practice 2SLS with Artificial Data Part 1

Lecture#12. Instrumental variables regression Causal parameters III

Final Exam. Question 1 (20 points) 2 (25 points) 3 (30 points) 4 (25 points) 5 (10 points) 6 (40 points) Total (150 points) Bonus question (10)

Practice exam questions

Final Exam. 1. Definitions: Briefly Define each of the following terms as they relate to the material covered in class.

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

1 Motivation for Instrumental Variable (IV) Regression

Econ 836 Final Exam. 2 w N 2 u N 2. 2 v N

ECON2228 Notes 7. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 41

Applied Statistics and Econometrics

Exercices for Applied Econometrics A

Applied Statistics and Econometrics

ECON Introductory Econometrics. Lecture 17: Experiments

At this point, if you ve done everything correctly, you should have data that looks something like:

8. Nonstandard standard error issues 8.1. The bias of robust standard errors

Applied Statistics and Econometrics

Specification Error: Omitted and Extraneous Variables

. *DEFINITIONS OF ARTIFICIAL DATA SET. mat m=(12,20,0) /*matrix of means of RHS vars: edu, exp, error*/

Question 1 [17 points]: (ch 11)

Please discuss each of the 3 problems on a separate sheet of paper, not just on a separate page!

ECON Introductory Econometrics. Lecture 16: Instrumental variables

Answers: Problem Set 9. Dynamic Models

Recent Advances in the Field of Trade Theory and Policy Analysis Using Micro-Level Data

Statistical Modelling in Stata 5: Linear Models

Section I. Define or explain the following terms (3 points each) 1. centered vs. uncentered 2 R - 2. Frisch theorem -

Maria Elena Bontempi Roberto Golinelli this version: 5 September 2007

Simultaneous Equations with Error Components. Mike Bronner Marko Ledic Anja Breitwieser

Lecture 19. Common problem in cross section estimation heteroskedasticity

Lecture 8: Functional Form

Problem Set 1 ANSWERS

Greene, Econometric Analysis (7th ed, 2012)

Lecture 3: Multivariate Regression

Econometrics Homework 1

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

The Simple Linear Regression Model

Problem set - Selection and Diff-in-Diff

5.2. a. Unobserved factors that tend to make an individual healthier also tend

Case of single exogenous (iv) variable (with single or multiple mediators) iv à med à dv. = β 0. iv i. med i + α 1

Econometrics II Censoring & Truncation. May 5, 2011

Interpreting coefficients for transformed variables

Heteroskedasticity Example

Applied Econometrics (MSc.) Lecture 3 Instrumental Variables

Empirical Application of Panel Data Regression

Nonrecursive models (Extended Version) Richard Williams, University of Notre Dame, Last revised April 6, 2015

Nonlinear Regression Functions

Multiple Regression Analysis: Estimation. Simple linear regression model: an intercept and one explanatory variable (regressor)

Lecture 24: Partial correlation, multiple regression, and correlation

Basic econometrics. Tutorial 3. Dipl.Kfm. Johannes Metzler

Lab 10 - Binary Variables

Statistical Inference with Regression Analysis

ECON2228 Notes 10. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 48

Introductory Econometrics. Lecture 13: Hypothesis testing in the multiple regression model, Part 1

Chapter 2: simple regression model

1 Independent Practice: Hypothesis tests for one parameter:

Wednesday, September 19 Handout: Ordinary Least Squares Estimation Procedure The Mechanics

Mediation Analysis: OLS vs. SUR vs. 3SLS Note by Hubert Gatignon July 7, 2013, updated November 15, 2013

Quantitative Methods Final Exam (2017/1)

Course Econometrics I

Section Least Squares Regression

ESTIMATING AVERAGE TREATMENT EFFECTS: REGRESSION DISCONTINUITY DESIGNS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

2.1. Consider the following production function, known in the literature as the transcendental production function (TPF).

Transcription:

Measurement Error Often a data set will contain imperfect measures of the data we would ideally like. Aggregate Data: (GDP, Consumption, Investment are only best guesses of theoretical counterparts and frequently revised by government statisticians (so earlier estimates must have been subject to error Survey Data: (income, health, age Individuals often lie, forget or round to nearest large number ( 02 a week or 00? Proxy Data: (Ability, Intelligence, Permanent Income Difficult to agree on definition, let alone measure Measurement Error in Dependent Variable True: y = b 0 + b + u ( Observe: y= y + e (2 ie dependent variable measured with error e e is a random residual term just like u, so E(e=0 Sub. (2 into ( y - e = b 0 + b + u y = b 0 + b + u + e y = b 0 + b + v where v = u + e (3 Ok to estimate (3 by OLS, since E(u = E(e = 0,u =,e = 0 (nothing to suggest variable correlated with meas. error in dependent variable So OLS estimates are unbiased in this case but standard errors are larger than would be in absence of meas. error True: 2 σ u β = (A N ~ 2 2 σ Estimate: u + σ β = e (B N Since var(v=var(u+e=var(e+var(u+2cov(e,u and assume things that cause measurement error in y are unrelated to residual u so cov(e,u=0. (B shows that residual variance in presence of measurement error in dep. variable now also contains an additional contribution from error in y variable, σ 2 e

Measurement Error in Explanatory Variable True: y = b 0 + b + u ( Observe: = + w (2 ie rhs var. measured with error (w sub. (2 into ( y = b 0 + b (-w + u y = b 0 + b - b w + u y = b 0 + b + v (3 where now v = - b w + u (so residual term again consists of 2 components Does this matter? In model, we know that (4, y ( ( bo + b + u Cov, u b = = = b + E ( b = b and by assumption,u = 0, so that but in presence of measurement error, we estimate (3 and so,v = +w,- b w + u Expanding terms (using (2 & (3 =,u+,-b w + w,u + w,-b w Since u and w are independent errors (caused by different factors, no reason to expect them to be correlated with each other or the value of (any error in should not depend on the level of, so w,u=,u=,-b w = 0 This leaves, v 0 = w, b w = b w, w = b w In other words there is now a correlation between the variable and the error term in (3.

So if we estimate (3 by OLS, this violates the main assumption need to get an unbiased estimate ie we need covariance between explanatory variable and residual to be zero, (see earlier notes. This is not the case when explanatory variable measured with error. (4 becomes b, y = ( bo + b = + u = b, v + = b + b w b so E( b b and OLS gives biased estimates in presence of measurement error in the explanatory variable Not only that, can show, using (5, that OLS estimates are always biased toward zero (Attenuation Bias if b >0 then b ols < b if b <0 then b ols > b ie closer to zero in both cases (means harder to reject any test that coefficient is zero Example of Consequences Of Measurement Error In Data This example uses artificially generated measurement error to illustrate the basic issues.. list /* list observations in data set */ y_ y_observ x_ x_observ. 75.4666 67.60 80 60 2. 74.980 75.4438 00 80 3. 02.8242 09.6956 20 00 4. 25.765 29.459 40 20 5. 06.5035 04.2388 60 40 6. 3.438 25.839 80 200 7. 49.3693 53.9926 200 220 8. 43.8628 52.9208 220 240 9. 77.528 76.3344 240 260 0. 82.2748 74.5252 260 280 The data set measerr.dta gives (unobserved values of y and x and their observed counterparts, (the st 5 observations on x underestimate by 20 and the last 5 overestimate the value by 20 First look at regression of y on x. reg y_ x_ Source SS df MS Number of obs = 0 ---------+------------------------------ F(, 8 = 05.60 Model 879.994 879.994 Prob > F = 0.0000 Residual 900.003252 8 2.500407 R-squared = 0.9296 ---------+------------------------------ Adj R-squared = 0.9208 Total 2779.9974 9 49.9997 Root MSE = 0.607 (5

y_ Coef. Std. Err. t P> t [95% Conf. Interval] ---------+-------------------------------------------------------------------- x_.5999999.0583875 0.276 0.000.465358.734647 _cons 25.00002 0.47727 2.386 0.044.8394026 49.6064 Now look at consequence of measurement error in dependent variable. reg y_obs x_ Source SS df MS Number of obs = 0 ---------+------------------------------ F(, 8 = 77.65 Model 880.0048 880.0048 Prob > F = 0.0000 Residual 223.99853 8 52.99987 R-squared = 0.9066 ---------+------------------------------ Adj R-squared = 0.8949 Total 304.0033 9 456.00037 Root MSE = 2.369 y_observ Coef. Std. Err. t P> t [95% Conf. Interval] ---------+-------------------------------------------------------------------- x_.600000.0680908 8.82 0.000.4429824.757078 _cons 24.99999 2.2846 2.046 0.075-3.75826 53.758 Consequence: Coefficients virtually identical, (unbiased but standard errors larger and hence t values smaller and confidence intervals wider. Measurement error in explanatory variable:. reg y_ x_obs Source SS df MS Number of obs = 0 -------------+------------------------------ F(, 8 = 83.5 Model 658.3625 658.3625 Prob > F = 0.0000 Residual 2.63494 8 40.204367 R-squared = 0.922 -------------+------------------------------ Adj R-squared = 0.903 Total 2779.9974 9 49.9997 Root MSE =.84 y_ Coef. Std. Err. t P> t [95% Conf. Interval] x_observ.4522529.0495956 9.2 0.000.3378852.5666206 _cons 50.70 9.22539 5.43 0.00 28.84338 7.39063 Consequence: both coefficients biased. and slope coefficient is biased toward zero (0.45 compared with 0.60 ie underestimate effect by 25% Intercept is biased upward (compare 50. with 25.0 Problem is that,u 0 ie residual and right hand side variable are correlated In such situations the variable is said to be endogenous Solution? - Get better data If that is not possible do something to get round the problem.

- replace the variable causing the correlation with the residual with one that is not but that at the same time is still related to the original variable Any variable that has these 2 properties is called an Instrumental Variable More formally, an instrument Z for the variable of concern satisfies,z 0 correlated with the problem variable 2 u = 0 but uncorrelated with the residual (so does not suffer from measurement error and also is not correlated with any unobservable factors influencing the dependent variable Instrumental variable (IV estimation proceeds as follows: Given a model y = b 0 + b + u ( Multiply by the instrument Z Zy = Zb 0 + b Z + Zu So y = Zb 0 + b + u = 0 + b + 0 (using rules on covariance of a constant and assumption above So b IV = y (compare with b OLS =, y The IV estimate is unbiased in large samples consistent - (can prove this using similar steps to above which makes it a useful estimation technique to employ However can show that (in the 2 variable case Var ( β IV = s 2 * N * r 2 Z where r xz 2 is the square of the correlation coefficient between endogenous variable and instrument (compared with OLS Var ( OLS = s 2 N * β

So IV estimation is less precise (efficient than OLS estimation when the explanatory variables are exogenous, but β IV the greater the correlation between and Z the smaller is. ie IV estimates are generally more precise if the correlation between instrument and endogenous variable is large. However if and Z are perfectly correlated then Z must also be correlated with u and so problem is not solved. Even though IV estimation is unbiased this property is asymptotic ie only holds when sample size is very large Since can always write the IV estimator as b IV = y = b + (just sub. in for y from ( u In small samples poor correlation between and Z can lead to small sample bias So: always check extent of correlation between and Z before any IV estimation You can have as many instruments as you like though finding good ones is a different matter. In large samples more is better. In small samples a minimum number of instruments is preferred. Where to find good instruments? - difficult. The appropriate instrument will vary depending on the issue under study. - In the case of measurement error, could use the rank of as an instrument (ie order the variable by size and use the number of the order rather than the actual vale. Though this assumes that the measurement error is not so large as to affect the ( ordering of the variable egen rankx=rank(x_obs /* stata command to create the ranking of x_observ */. list x_obs rankx x_observ rankx. 60 2. 80 2 3. 00 3 4. 20 4 5. 40 5 6. 200 6 7. 220 7 8. 240 8 9. 260 9 0. 280 0 ranks from smallest observed x to largest

Now do instrumental variable estimates using rankx as the instrument for x_obs ivreg y_t (x_ob=rankx Instrumental variables (2SLS regression Source SS df MS Number of obs = 0 -------------+------------------------------ F(, 8 = 84.44 Model 654.584 654.584 Prob > F = 0.0000 Residual 25.47895 8 40.684869 R-squared = 0.99 -------------+------------------------------ Adj R-squared = 0.9009 Total 2779.9974 9 49.9997 Root MSE =.86 y_ Coef. Std. Err. t P> t [95% Conf. Interval] x_observ.460465.050086 9.9 0.000.344944.576056 _cons 48.72095 9.307667 5.23 0.00 27.25743 70.8447 Instrumented: x_observ Instruments: rankx Can see both estimated coefficients are a little closer to their values than estimates from regression with measurement error (but not much In this case the rank of is not a very good instrument Note that standard error in instrumented regression is larger than standard error in regression of y_ on x_observed as expected with IV estimation Testing for Endogeneity It is good practice to compare OLS and IV estimates. If estimates are very different this may be a sign that things are amiss. Using the idea that IV estimation will always be (asymptotically unbiased whereas OLS will only be unbiased if,u = 0 then can do the following: Wu-Hausman Test for Endogeneity. Given y = b 0 + b + u (A Regress on the instrument(s Z = d 0 + d Z + v Save the residuals v (B 2. Include this residual as an extra term in the original model ie estimate y = b 0 + b + b 2 v + e and test whether b 2 = 0 (using a t test

3. If b 2 = 0 conclude there is no correlation between and u If b 2 0 conclude there is correlation between and u (intuitively, since assume Z is uncorrelated with u the Z variables are exogenous -, only way could be correlated with u is through v (in (B and u= b 2 v + e Example: The data set ivdat.dta contains information on the number of GCSE passes of a sample of 6 year olds and the total income of the household in which they live. Income tends to be measured with error. Individuals tend to mis-report incomes, particularly third-party incomes and non-labour income. The following regression may therefore be subject to measurement error in one of the right hand side variables, (the gender dummy variable is less subject to error.. reg nqfede inc female Source SS df MS Number of obs = 252 -------------+------------------------------ F( 2, 249 = 4.55 Model 274.029395 2 37.04698 Prob > F = 0.0000 Residual 2344.9706 249 9.4755263 R-squared = 0.046 -------------+------------------------------ Adj R-squared = 0.0974 Total 269.00 25 0.4342629 Root MSE = 3.0688 nqfede Coef. Std. Err. t P> t [95% Conf. Interval] inc.0396859.0087786 4.52 0.000.022396.0569758 female.7235.387686 3.02 0.003.4087896.93593 _cons 4.929297.4028493 2.24 0.000 4.3587 5.722723 To test endogeneity first regress the suspect variable on the instrument and any exogenous variables in the original regression reg inc ranki female Source SS df MS Number of obs = 252 -------------+------------------------------ F( 2, 249 = 247.94 Model 8379.42 2 40689.7056 Prob > F = 0.0000 Residual 40863.626 249 64.0948 R-squared = 0.6657 -------------+------------------------------ Adj R-squared = 0.6630 Total 22243.037 25 487.024053 Root MSE = 2.8 inc Coef. Std. Err. t P> t [95% Conf. Interval] ranki.247072.00979 22.26 0.000.225236.2689289 female.2342779.68777 0.4 0.885-2.953962 3.42258 _cons.77225.855748 0.42 0.678-2.88272 4.42724. save the residuals. predict uhat, resid 2. include residuals as additional regressor in the original equation. reg nqfede inc female uhat

Source SS df MS Number of obs = 252 -------------+------------------------------ F( 3, 248 = 9.94 Model 28.289 3 93.7070629 Prob > F = 0.0000 Residual 2337.8788 248 9.42693069 R-squared = 0.073 -------------+------------------------------ Adj R-squared = 0.0965 Total 269.00 25 0.4342629 Root MSE = 3.0703 nqfede Coef. Std. Err. t P> t [95% Conf. Interval] inc.0450854.007655 4.9 0.000.023889.0662888 female.76652.387907 3.03 0.003.426329.940672 uhat -.06473.08669-0.87 0.387 -.052847.020520 _cons 4.753386.45205 0.53 0.000 3.8647 5.642062 Now added residual is not statistically significantly different from zero, so conclude that there is no endogeneity bias in the OLS estimates. Hence no need to instrument. Note you can also get this result by typing the following command after the ivreg command ivendog Tests of endogeneity of: inc H0: Regressor is exogenous Wu-Hausman F test: 0.75229 F(,248 P-value = 0.38659 Durbin-Wu-Hausman chi-sq test: 0.762 Chi-sq( P-value = 0.38267 the first test is simply the square of the t value on uhat in the last regression (since t 2 = F N.B. This test is only as good as the instruments used and is only valid asymptotically. This may be a problem in small samples and so you should generally use this test only with sample sizes well above 00.