. *DEFINITIONS OF ARTIFICIAL DATA SET. mat m=(12,20,0) /*matrix of means of RHS vars: edu, exp, error*/

. DEFINITIONS OF ARTIFICIAL DATA SET. mat m=(,,) /matrix of means of RHS vars: edu, exp, error/. mat c=(5,-.6, \ -.6,9, \,,.) /covariance matrix of RHS vars /. mat l m /displays matrix of means / c c c3 r. mat l c /displays covariance matrix/ symmetric c[3,3] c c c3 r 5 r -.6 9 r3.. drawnorm edu exp e,n(3) means(m) cov(c) (obs 3). Compare normal and lognormal distribution. g Y=exp(logY). gr Y,bin(4) norm saving($pathc\e,replace). gr logy,bin(4) norm saving($pathc\e,replace). gr using $pathc\e $pathc\e.7757.7997 Fraction Fraction 49.8 Y 8969. 7.384 logy 9.8557

. ============= HETEROSKEDASTICITY ======================= References: Stata Reference Manual [N-R], regression diagnostic, pp.357- Stata Programming [P], _robust, pp.34 Wooldridge, Heteroskedasticity, pp.57 Kennedy, ch.8, pp.33-56. Original error w/o heteroskedasticity. reg logy edu exp exp Source SS df MS Number of obs = 4 -------------+------------------------------ F( 3, 36) =. Model 6.76933 3.9344 Prob > F =. Residual.746788 36.949898 R-squared =.36 -------------+------------------------------ Adj R-squared =.35 Total 65.4537 39.4777 Root MSE =.389 logy Coef. Std. Err. t P> t [95% Conf. Interval] edu.675944.989.66..67447.73444 exp.987.857 3.93..5683.6789 exp -.468.69-6.78. -.635 -.338 _cons 7.63638.448 7.37. 7.548485 7.748. predict res,res. g res=res^. predict logy_h (option xb assumed; fitted values). gr res logy_h,xlab ylab yline() t("no heter") saving($pathc\e3,replace). mat se=sqrt(el(e(v),,)) /sqrt(diagonal elements of V-C)=std.error of the estimator /. mat l se symmetric se[,] c r.989. hettest / test using fitted values of logy / Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of logy chi() =.9 Prob > chi =.676. hettest edu / test using edu / Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu chi() =.76 Prob > chi =.843. hettest,rhs / test using exp /

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu exp exp chi(3) = 3.34 Prob > chi =.345. Heteroskedastic error term: variance is function of edu. g e_a=sqrt(edu)e. gr e_a edu,xlab ylab yline() t("heter=f(edu)") saving($pathc\e4,replace). g logy_a=7.6+ edu.7 + exp.- exp.5 + e_a. reg logy_a edu exp exp Source SS df MS Number of obs = 4 -------------+------------------------------ F( 3, 36) = 5.94 Model 53.75964 3 7.998675 Prob > F =. Residual 4.999 36.45989 R-squared =.9 -------------+------------------------------ Adj R-squared =.5 Total 454.755 39.47664 Root MSE =.6 logy_a Coef. Std. Err. t P> t [95% Conf. Interval] edu.64565.65 6.4..4396.84869 exp.9537.98.97.33 -.9765.87699 exp -.398.375 -.68.94 -.8637.677 _cons 7.69345.5444 49.88. 7.3996 7.995884. predict logy_ah (option xb assumed; fitted values). predict res_a,res. g res_a=res_a^. gr res_a logy_ah,xlab ylab yline() t("heter=f(edu)") saving($pathc\e5,replace). hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of logy_a chi() = 6.79 Prob > chi =.. hettest edu Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu chi() = 3.74 Prob > chi =.. reg res_a edu,noc / Note that coefficient on edu=var(e) / Source SS df MS Number of obs = 4 -------------+------------------------------ F(, 39) = 7.55

Model 749.786 749.786 Prob > F =. Residual 5494.4973 39.56855995 R-squared =.3336 -------------+------------------------------ Adj R-squared =.333 Total 843.9334 4 3.85339 Root MSE =.67 res_a Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edu.93363.8534 3.7..877656.98957. Heteroskedastic error term: variance =f(external var). g x=uniform() / Generate normaly distributed variable x /. g e_b=e(x+.) / Heteroskedastic error: variance =f(external variable x) /. gr e_b x,xlab ylab yline() t("heter=f(x)") saving($pathc\e6,replace). g logy_b=7.6+ edu.7 + exp.- exp.5 + e_b. reg logy_b edu exp exp Source SS df MS Number of obs = 4 -------------+------------------------------ F( 3, 36) = 66.37 Model 64.8938 3.633346 Prob > F =. Residual 69.8589 36.375483 R-squared =.486 -------------+------------------------------ Adj R-squared =.488 Total 34.74995 39.6996688 Root MSE =.885 logy_b Coef. Std. Err. t P> t [95% Conf. Interval] edu.683546.759 39.4..6499.77884 exp.76.6733 7...844.54 exp -.4879.45 -.4. -.5673 -.484 _cons 7.64937.6397 89.8. 7.57334 7.67653. hettest Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of logy_b chi() = 3.8 Prob > chi =.5. hettest,rhs Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: edu exp exp chi(3) = 8.93 Prob > chi =.3. hettest x Breusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: x chi() = 85.34 Prob > chi =.

. gr using $pathc\e3 $pathc\e4 $pathc\e5 $pathc\e6 No heter Heter=f(edu) Residuals e_a 4 - - 4 8 8.5 9 Fitted values Heter=f(edu) -4 5 edu 5 Heter=f(x) Residuals - e_b.5 -.5-4 8 8.5 9 Fitted values -.5 x

. Heteroskedasticity robust estimate of coefficient V-C matrix: sandwich estimator. Heteroskedasticity robust estimate of coef. V-C matrix: sandwich estimator. reg logy_b edu exp exp,robust / Robust estimation of V-C matrix / Regression with robust standard errors Number of obs = 63 F( 3, 59) = 574.8 Prob > F =. R-squared =.453 Root MSE =.9 Robust logy_b Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edu.67993.935 35.5..6494.779 exp.999.7577 5.3..57439.638 exp -.4475.433 -.33. -.535 -.366 _cons 7.65633.83748 69.8. 7.6388 7.7678. mat Vreg=e(V) / Robust coef. V-C matrix /. mat l Vreg symmetric Vreg[4,4] edu exp exp _cons edu 3.735e-6 exp -4.4e-8 3.9e-6 exp.573e-9-7.36e-8.878e-9 _cons -.4498 -.56 5.474e-7.853. reg logy_b edu exp exp,mse / OLS w/o robust V_C / Source SS df MS Number of obs = 63 -------------+------------------------------ F( 3, 63) =.6 Model 64.85838 3.693439 Prob > F =. Residual 78.8437 63.364577 R-squared =.453 -------------+------------------------------ Adj R-squared =.456 Total 43.749 6.664667 Root MSE = logy_b Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edu.67993.9965 6.84..48457.873858 exp.999.956..36 -.8765.7469 exp -.4475.55 -.98.47 -.8897-5.38e-6 _cons 7.65633.4683 5.5. 7.3684 7.94394. predict res_b,res. mat D=e(V) / Non-robust V-C matrix /. mat l D symmetric D[4,4] edu exp exp _cons edu.9854 exp -.57e-6.8384 exp 5.6e-8 -.995e-6 5.83e-8 _cons -.8645 -.6989.473.5544

. matrix accum M = edu exp exp [iweight=res_b^] /Salami for the Sandwich/. mat l M symmetric M[4,4] edu exp exp _cons edu.856 exp 9469.987 3934.94 exp 47586.4 56954. 34376 _cons 957.88854 64.5539 3934.94 78.8437. mat V=e(N)/(e(N)-e(df_m)-) DMD /Sandwich X'WX /. mat l V symmetric V[4,4] edu exp exp _cons edu 3.735e-6 exp -4.4e-8 3.9e-6 exp.573e-9-7.36e-8.878e-9 _cons -.4498 -.56 5.474e-7.853 Compare martrix V and Vres. They are identical

Measurement Error in the Dependent Variable MEASUREMENT ERROR () y = β + β x + β x +... + β x u () y = y + e K K + Suppose equation () represents the population model. Instead of observing y, we observe y, or y plus measurement error (e, where E[e ]=). y = β + βx + β x +... + β K xk + ( u + e ) 443 Var ( v) σ u + σ e = (requires Cov(u, e )=) v To see the implications of measurement error in y, plug eq. () into eq. (). The OLS estimators of the β j will be affected to the extent that the composite error v is correlated with the explanatory variables. If the measurement error e is correlated with x, the OLS estimators will be biased and inconsistent. Under the classic error-in-variables assumption, Cov( y, e )=, and thus v and x are uncorrelated. Measurement Error in an Explanatory Variable (K=) (3) y = β + x + u β (4) x = x + e or Suppose instead that one of the explanatory variables is measured with error that is, we observe x instead of x in equation (3). (Again, E[e ]=). y = β + βx + ( u βe ) 443 plim ˆ Cov( x, v) β = β + Var( x ) v To see the implications of measurement error in x, plug eq. (4) into eq. (3). The OLS estimators of β will be affected to the extent that the composite error v is correlated with x. Under the classic error-in-variables assumption, Cov( x, e )=. Thus, Cov [( x + e )( u β e ] = β σ ( x, v) = E ) e Var ( x ) = Var( x ) + Var( e ) = σ x + σ e plim plim ˆ β ˆ β σ x β σ x + σ e = σ r β σ r + σ e = Under the classic error-in-variables assumption, it can be shown that the OLS estimator is inconsistent and (asymptotically) biased downward (as shown at left). The term multiplying β is called the attenuation bias (it is always <). When K> (and x is the only mismeasured variable), the attenuation bias is as shown at left ( r is the population error from the regression of x on all other explanatory variables).

. =========== ERRORS IN VARIABLES =========================. Case A: Error on logy. g error=invnorm(uniform()) / Measurement error/. g logyx=logy+.error / logy with error /. dotplot logy logyx, ny(5) saving($pathc\e7,replace). gr logy logyx logy,xlab ylab s(op) saving($pathc\e8,replace). reg logy edu exp exp / Model w/o error / Source SS df MS Number of obs = 4 -------------+------------------------------ F( 3, 36) =. Model 6.76933 3.9344 Prob > F =. Residual.746788 36.949898 R-squared =.36 -------------+------------------------------ Adj R-squared =.35 Total 65.4537 39.4777 Root MSE =.389 logy Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edu.675944.989.66..67447.73444 exp.987.857 3.93..5683.6789 exp -.468.69-6.78. -.635 -.338 _cons 7.63638.448 7.37. 7.548485 7.748. reg logyx edu exp exp / Model with error in logy / See that edu coefficient is not changed, only std. error and R/ Source SS df MS Number of obs = 4 -------------+------------------------------ F( 3, 36) = 7.44 Model 7.6538 3 3.388343 Prob > F =. Residual 93.66 36.37675 R-squared =.93 -------------+------------------------------ Adj R-squared =.9 Total 363.886 39.69836973 Root MSE =.3744 logyx Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edu.737.35866 9.6..6379.773463 exp.547.3476 3.36..4788.865 exp -.55.83-6.3. -.663 -.3378 _cons 7.63564.5389 4.7. 7.57877 7.795

. Case B: Stochastic error in edu. g edux=edu+error / Education years with error /. dotplot edu edux, ny(5) saving($pathc\e9,replace). gr edu edux edu,xlab(,9,3,8) ylab(,9,3,8) s(op) saving($pathc\e,replace). reg logy edu exp exp Residual.746788 36.949898 R-squared =.36 logy Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edu.675944.989.66..67447.73444... reg logy edux exp exp / See that edu coefficient is smaller/ Source SS df MS Number of obs = 4 -------------+------------------------------ F( 3, 36) = 4.63 Model 43.783947 3 4.59436 Prob > F =. Residual.6737 36.37787 R-squared =.649 -------------+------------------------------ Adj R-squared =.638 Total 65.4537 39.4777 Root MSE =.35 logy Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edux.388487.99 6.95..34354.433434 exp.4975.984 3.5..4658.634 exp -.4398.7-6.. -.583 -.983 _cons 7.979665.39455.7. 7.998 8.5733. corr edux error,cov / Bias ~ COV(edux,error)/VAR(eduX) / edux error -------------+------------------ edux 9.454 error.5978.99678. gr using $pathc\e7 $pathc\e8 $pathc\e9 $pathc\e

. Case C: systematic error =f(edu). g eduq=.8edu / Education years with error /. gr edu eduq edu,xlab(,9,3,8) ylab(,9,3,8) saving($pathc\e,replace). dotplot edu eduq, ny(5) saving($pathc\e,replace). reg logy edu exp exp / Pure regression / Source SS df MS Number of obs = 4 Residual.746788 36.949898 R-squared =.36 logy Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- edu.675944.989.66..67447.73444... reg logy eduq exp exp / See that edu coefficient is larger/ Source SS df MS Number of obs = 4 -------------+------------------------------ F( 3, 36) =. Model 6.7695 3.935 Prob > F =. Residual.746787 36.949898 R-squared =.36 -------------+------------------------------ Adj R-squared =.35 Total 65.4537 39.4777 Root MSE =.389 logy Coef. Std. Err. t P> t [95% Conf. Interval] eduq.84499.3786.66..7788.985 exp.987.857 3.93..5683.6789 exp -.468.69-6.78. -.635 -.338 _cons 7.63638.448 7.37. 7.548485 7.748

. gr using $pathc\e $pathc\e 8 edu eduq 9.68 3 9 9 3 8 edu 3.735 edu eduq. locpoly logy edux,plot(scatter logy edu) Local polynomial smooth Degree: logy 7 8 9 5 5 edux logy logy locpoly smooth: logy