Lecture notes to Stock and Watson chapter 4

Lecture notes to Stock and Watson chapter 4 Introductory linear regression Tore Schweder Sept 2009 TS () LN3 03/09 1 / 15

Regression "Regression" is due to Francis Galton (1822-1911): how is a son s height related to his father s height? TS () LN3 03/09 2 / 15

Regression towards the mean Figure: Galton s original diagram. Parents-child pairs by heigth. TS () LN3 03/09 3 / 15

The geometry of linear regression Figure: The regression curve is y = f (x) = E [Y jx = x]. TS () LN3 03/09 4 / 15

The linear regression model Question: how is a response variable Y related to a stimulus variable X (explanatory/control/explanatory)? Assuming a linear relation, 1 what is the slope? 2 is the slope positive? 3 how good does the estimated line t the observed data? Example: Y = growth in BNP, X = in ation previous year. D : A sample of size n of pairs explanatory variables X and response variable Y. M : (X 1, Y 1 ),, (X n, Y n ) is an iid random sample from an in nite population. Y = β 0 + β 1 X + u; E [Y jx = x] = β 0 + β 1 x, Eu = 0, cov(u, X ) = 0. β 0, β 1 are parameters. TS () LN3 03/09 5 / 15

OLS Least squares and regression was known by Laplace and Gauss (80 years before Galton). b β 0, bβ 1 = arg min b 0,b 1 bβ 1 = s XY s 2 X n i=1 (Y i (b 0 + b 1 X i )) 2 = r s Y s X, bβ 0 = Y bβ 1 X Predicted response given the stimuli: by i = bβ 0 + bβ 1 X i is the tted value. TS () LN3 03/09 6 / 15

Measures of t Homoscedasticity when var(y jx = x) = σ 2 is independent of x. TS () LN3 03/09 7 / 15

Measures of t Homoscedasticity when var(y jx = x) = σ 2 is independent of x. 2 Then bσ 2 = 1 n 2 n i=1 Y i by i = 1 n 2 RSS is unbiased. TS () LN3 03/09 7 / 15

Measures of t Homoscedasticity when var(y jx = x) = σ 2 is independent of x. 2 Then bσ 2 = 1 n 2 n i=1 Y i by i = 1 n 2 RSS is unbiased. TSS = n i=1 Y i Y 2 = n i=1 Y i 2 by i + n i=1 Y bi 2 Y = RSS + ESS TS () LN3 03/09 7 / 15

Measures of t Homoscedasticity when var(y jx = x) = σ 2 is independent of x. 2 Then bσ 2 = 1 n 2 n i=1 Y i by i = 1 n 2 RSS is unbiased. TSS = n i=1 Y i Y 2 = n i=1 Y i by i 2 + n i=1 b Y i Y 2 = RSS + ESS R 2 = ESS TSS = 1 RSS TSS = r 2 XY TS () LN3 03/09 7 / 15

Be aware of extreme observations! Figure: The OLS line hinges on one point. Figure: Heavy tails in the conditional distribution of Y jx distorts the OLS line. TS () LN3 03/09 8 / 15

Desired number of children and age for beginning master students n Children Age OLS ρ R 2 2008 38 1.92 23.8 C = 1.88 + 0.002A 0.004 0.00002 2009 29 2.13 25.4 C = 1.61 + 0.021A 0.06 0.004 Children + rnorm(30, 0, 0.1) 0 1 2 3 4 5 Children 0 1 2 3 4 5 25 30 35 A ge 2009 20 22 24 26 28 30 32 34 A ge 2008 TS () LN3 03/09 9 / 15

Desired number of children and number of siblings, 2009 TS () LN3 03/09 10 / 15

0 1 2 3 4 5 0 2 4 6 Siblings Children Fitted values Figure: STATA scatter diagram of Children vs. Siblings, with OLS line. Some points represent several observations. Is the (slightly) positive relation between Age and Children due to Siblings being an intermediate variable? Age!Siblings!Children, but controling for siblings, no e ect of Age on Siblings? TS () LN3 03/09 11 / 15

Constructed model and simulated data y = number of children 0 1 2 3 4 x = age (and probability) 20 25 30 35 40 y 20 25 30 35 40 x = age (and probability) 0 1 2 3 4 y = number of children 0 1 2 3 4 var(y1x=x) 0.0 0.4 0.8 1.2 20 25 30 35 x 20 25 30 35 40 x = age Figure: UL: E [Y = yjx = x] = 2.02 + 0.101x, and P [Y = yjx = x] shown by horizontal thick lines for some values of x; UR: the same, but axes interchanged; LL the scatter plot of a simulated sample of size n = 43, the points are slightly jittered, with the OLS line y = 1.9 + 0.097x; LR: var [Y jx = x]. TS () LN3 03/09 12 / 15

Constructed model - repeated simulations density 0.0 0.05 0.10 0.15 0.20 density 0 1 2 3 4 5 10 5 0 5 beta0 0.2 0.0 0.2 0.4 beta1 density 0.0 0.1 0.2 0.3 0.4 density 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 4 2 0 2 4 T0 Figure: Approximate densities of bβ 0 (UL), bβ 1 (UR), T 0 = b β 0 β 0 /SE b β 0 (LL), and T 1 = b β 1 β 1 /SE b β 1 (LL), based on 1000 replicates of simulated data of size n = 43 in constructed model for (X, Y ).Dotted curve is the normal density. TS () LN3 03/09 13 / 15 T1

Constructed model - repeated simulations (cont) beta0.hat 10 5 0 5 beta1.hat 0.1 0.1 0.3 2 0 2 Quantiles of Standard Normal 2 0 2 Quantiles of Standard Normal T0 2 0 2 4 T1 4 2 0 2 4 2 0 2 Quantiles of Standard Normal 2 0 2 Quantiles of Standard Normal Figure: Normal probability plots (QQ-plots against N(0,1)) for the same simulated material as in previous gure. TS () LN3 03/09 14 / 15

Problems to be done in class SW: 4.3 TS () LN3 03/09 15 / 15