Applied Econometrics Professor Bernard Fingleton
Regression A quick summary of some key issues
Some key issues Text book JH Stock & MW Watson Introduction to Econometrics 2nd Edition Software Gretl Gretl.sourceforge.net 3
Course outline Week 5 introduction to regression Week 6 endogeneity & instrumental variables Week 7 panel data Week 8 spurious regression, Dickey-Fuller etc Week 9 cointegration and error correction Week 10 - autoregressive distributed lag models Week 11 vector autoregression (VAR), vector error correction, multiple cointegrating vectors Week 12 VAR, Johansen etc 4
Regression Regression is used to analyze how a single dependent variable (or Y variable) is affected by the values of one or more independent variables (also called regressors, X variables, factors). 5
6 Multiple regression 1 1 2 2 1 1 0 1 1 2 2 1 1 0 1 1 2 2 1 1 0 2 2 1 1 0 ˆ... ˆ ˆ ˆ ˆ... ) (... + + + = + + + = + + + + = + + + = k k k k k k X b X b X b b Y X b X b X b b Y E e X b X b X b b Y e X b X b b Y b 1 is change in E(Y) per unit change in X 1 b k-1 is change in E(Y) per unit change in X k-1
Interpreting partial regression coefficients b i is the change in E(Y) per unit change in X i, with all other variables held statistically constant E(Y) = b 0 + b 1 X 1 + b 2 X 2 assume we change X 1 by an amount equal to ΔX 1, but keep X 2 constant; this changes E(Y) to new E(Y) new E(Y) = b 0 + b 1 (X 1 + ΔX 1 ) + b 2 X 2 new E(Y) - E(Y) = ΔΕ( Y) = b 1 ΔX 1 thus if ΔX 1 = 1, ΔΕ( Y) = b 1 7
Theory indicating X variables output = f(labour, capital) adopt a Cobb-Douglas production function output = labour α capital β ln(output) = α ln(labour) + β ln(capital) if α + β > 1 we have increasing returns doubling inputs more than doubles output if α + β = 1 we have constant returns to scale 9
dy / Y elasticity = ;% change in Y per 1% change in X dx / X log Yˆ = b ˆ + bˆ log X 0 1 Yˆ = exp( bˆ ) X = b ˆ X bˆ 0 0 bˆ 1 1 dyˆ ˆ ˆ ˆ b1 1 = b 0 b1x dx dyˆ ˆ ˆ b1 = b1( b 0 X ) X dx dyˆ ˆ ˆ 1 = byx 1 dx ˆ / ˆ ˆ dy Y b1 = dx / X ˆ 1
Data indicating X variables letting the data speak is a good way to obtain a realistic theory we look at the data to identify important variables unimportant variables, that are indistinguishable from random variation and can be left in the error term 18
Methods for choosing X variables R 2 t tests F tests 19
R 2 2 2 R = 1 - eˆ / S ˆ YY = corr(y,y ) 2 indicates, on a scale from 0 to 1, or 0% to 100% how much of Y s s variation is accounted for by the Xs s contained in the regression model [this equation assumes that b 0 is present] 20
R 2 DISADVANTAGES R 2 s s probability distribution is not constant making it difficult to objectively compare the R 2 of different models. R 2 ALWAYS increases if additional (perhaps unimportant)variables are added to the model. Hence the most complex model always seems the best using R 2. BUT R 2 -adjusted takes into account the number of Xs. 21
t -test Say we wish to test whether a particular variable, X i, should be included [In practice i could be,say, 2 if we were testing X 2 ] H 0 : b i =0 [X i has no effect on Y] t bˆ b bˆ = i i = i σˆ s se..( bˆ ) XX t ratio ~ t T k i When H o correct for population T = sample size, k = number of regression coefficients 22
23
Taiwanese agricultural output The regression equation is ln output = - 3.34 + 1.4988 ln labour + 0.4899 ln capital Predictor Coef Stdev t-ratio p Constant b 0 = -3.338 2.450-1.36 0.198 ln labour b 1 = 1.4988 0.5398 2.78 0.017 ln capital b 2 = 0.4899 0.1020 4.80 0.000 For ln(labour) from t tables t = 1.4988/0.5398 = 2.78 t crit = 2.18 with (T-k)( ) = 15-3 3 = 12 degrees of freedom, crit is the value (ignoring sign) with p-value p = 0.05 in the 12 distribution t crit t 12 since t > t crit (ie p-value for t = 0.017 < 0.05) reject H o that b 1 =0
Analysis of Variance (ANOVA) 25
F test : Jointly testing a group of X s H : b = b =...= b = O 0 1 2 k 1 The test statistic is the F ratio calculated from an ANOVA table This is calculated automatically whenever a regression model is fitted to a data set 26
F test of a group of X variables H : b = b =... b = 0 H O A 1 2 k 1 is that H is untrue O ( SYY D)/( k 1) F = ~ Fk 1, T k assuming HO is true D/( T k) D= ( Y Yˆ ) = eˆ i S = ( Y Y) YY i i i 2 2 i i i i 2
The regression equation is cons = 24.8 + 0.942 income - 0.0424 wealth Predictor Coef Stdev t-ratio p Constant 24.775 6.752 3.67 0.008 income 0.9415 0.8229 1.14 0.290 wealth -0.04243 0.08066-0.53 0.615 R-sq = 0.964 Analysis of Variance (ANOVA table) SOURCE DF SS MS F p Regression 2 8565.6 4282.8 92.40 <0.0001 Error 7 324.4 46.3 Total 9 8890.0 F = 92.4, p-value in F 2,7 is <0.0001
Analysis of Variance (ANOVA table) SOURCE DF SS MS F Regression k-1 S YY -D (S YY -D)/(k-1) (S YY -D)/(k-1) 2 error T-k D = e i D/(T-k) D/(T-k) Total T-1 S YY Analysis of Variance (ANOVA table) SOURCE DF SS MS F p Regression 2 8565.6 4282.8 92.40 <0.0001 Error 7 324.4 46.3 Total 9 8890.0
34
Interpreting the F test since the p-value for F = 92.4 is < 0.05 the H o is reject this means H : b = b =...= b = O is rejected 1 2 k 1 0 indicating that one or more b i is unequal to zero 35
Multicollinearity b -hence t- for any one variable will change as X variable(s) added or subtracted EXCEPT when the X variables not correlated with each other in which case t is the same in bivariate and multiple regression 37
Multicollinearity : consumption, income and wealth example A) The regression equation is consumption = 24.5 + 0.509 income Predictor Coef Stdev t-ratio p Constant 24.455 6.414 3.81 0.005 income 0.50909 0.03574 14.24 0.000 B) The regression equation is consumption = 24.8 + 0.942 income - 0.0424 wealth Predictor Coef Stdev t-ratio p Constant 24.775 6.752 3.67 0.008 income 0.9415 0.8229 1.14 0.290 wealth -0.04243 0.08066-0.53 0.615 Correlation of income and wealth = 0.999
Multicollinearity in A) income is highly significant similarly wealth alone is highly significant in B) wealth and income are insignificant yet paradoxically R 2 = 0.96 the reason for such extreme changes in the apparent effects is multicollinearity 39
Multicollinearity multicollinearity means that the X variables are very highly correlated, so that they are not distinct so with severe multicollinearity the estimated b s become very unreliable if we change X 1 (say) slightly, b 1 changes a lot the standard errors of the b s become very large hence we see low t values but high R 2 40
Multicollinearity Ln Halifax House price index, Greater London and Scotland Model 1: OLS, using observations 1983:2-2007:2 (T = 97) Dependent variable: lns coefficient std. error t-ratio p-value --------------------------------------------------------- const 1.75890 0.171794 10.24 5.12e-017 *** lngl 0.629605 0.0305174 20.63 7.26e-037 *** Mean dependent var 5.287020 S.D. dependent var 0.375992 Sum squared resid 2.476363 S.E. of regression 0.161453 R-squared 0.817532 Adjusted R-squared 0.815612 F(1, 95) 425.6400 P-value(F) 7.26e-37 Model 2: OLS, using observations 1983:2-2007:2 (T = 97) Dependent variable: lns coefficient std. error t-ratio p-value --------------------------------------------------------- const 1.10835 0.0988124 11.22 4.92e-019 *** lngl -0.376925 0.0653108-5.771 1.01e-07 *** lne_ro 1.15071 0.0723979 15.89 2.21e-028 *** Mean dependent var 5.287020 S.D. dependent var 0.375992 Sum squared resid 0.671550 S.E. of regression 0.084523 R-squared 0.950518 Adjusted R-squared 0.949465 F(2, 94) 902.8345 P-value(F) 4.36e-62
lns, lne_ro and lngl 7 lns lngl lne_ro 6.5 6 5.5 5 4.5 1985 1990 1995 2000 2005 42
Solutions to multicollinearity problems use less-correlated X variables eg data for a longer/different time period, so that X 1, X 2, etc become more separated use the change in Y, X at each point in time rather than the levels of Y, X, since changes tend not to be as strongly correlated as levels 43
Solutions to multicollinearity problems use the change in Y, X at each point in time Actually the difference in logs equals the exponential growth rate X(t)=105 X(t-1)=100 growth 5% LnX(t) = 4.65396 lnx(t-1) = 4.60517 lnx(t)-lnx(t-1)= 0.04879 44
Solutions to multicollinearity problems : use differences = growth with logs Model 3: OLS, using observations 1983:2-2007:2 (T = 97) Dependent variable: d_lns coefficient std. error t-ratio p-value -------------------------------------------------------- const 0.00749513 0.00346701 2.162 0.0332 ** d_lngl -0.176427 0.122453-1.441 0.1530 d_lne_ro 0.684976 0.138318 4.952 3.22e-06 *** Mean dependent var 0.017016 S.D. dependent var 0.031070 Sum squared resid 0.069073 S.E. of regression 0.027108 R-squared 0.254643 Adjusted R-squared 0.238785 F(2, 94) 16.05706 P-value(F) 1.00e-06 45
Fitted values versus Scotland house price growth 0.14 0.12 fv d_lns 0.1 0.08 0.06 0.04 0.02 0-0.02-0.04-0.06 1985 1990 1995 2000 2005 46
Fitted values versus Scotland house price growth : with quarterly dummies Model 4: OLS, using observations 1983:2-2007:2 (T = 97) Dependent variable: d_lns coefficient std. error t-ratio p-value -------------------------------------------------------- const 0.00891064 0.00521555 1.708 0.0910 * d_lngl -0.233089 0.112314-2.075 0.0408 ** d_lne_ro 0.551339 0.131062 4.207 6.06e-05 *** dq1-0.0122192 0.00710341-1.720 0.0888 * dq2 0.0233635 0.00754418 3.097 0.0026 *** dq3-0.00229161 0.00721425-0.3177 0.7515 Mean dependent var 0.017016 S.D. dependent var 0.031070 Sum squared resid 0.054784 S.E. of regression 0.024536 R-squared 0.408834 Adjusted R-squared 0.376353 F(5, 91) 12.58663 P-value(F) 2.70e-09 48
Observed S growth and fitted values : with quarterly dummies 0.14 0.12 fv d_lns 0.1 0.08 0.06 0.04 0.02 0-0.02-0.04-0.06 1985 1990 1995 2000 2005 50
F test of seasonal effects Model 5: OLS, using observations 1983:2-2007:2 (T = 97) Dependent variable: d_lns coefficient std. error t-ratio p-value -------------------------------------------------------- const 0.00749513 0.00346701 2.162 0.0332 ** d_lngl -0.176427 0.122453-1.441 0.1530 d_lne_ro 0.684976 0.138318 4.952 3.22e-06 *** Mean dependent var 0.017016 S.D. dependent var 0.031070 Sum squared resid 0.069073 S.E. of regression 0.027108 R-squared 0.254643 Adjusted R-squared 0.238785 F(2, 94) 16.05706 P-value(F) 1.00e-06 Log-likelihood 213.8572 Akaike criterion -421.7143 Schwarz criterion -413.9902 Hannan-Quinn -418.5911 rho -0.164960 Durbin-Watson 2.314335 Comparison of Model 5 and Model 4: Null hypothesis: the regression parameters are zero for the variables dq1, dq2, dq3 Test statistic: F(3, 91) = 7.9117, with p-value = 9.54953e-005 51
Summary Statistical criteria R 2 gives the overall % of Y s variation accounted for by the X variables t test for testing the significance of individual Xs F test for testing whether groups of Xs should be present in the model Multicollinearity is a problem that occurs when we have highly correlated variables (as often occurs in time series) Solve by reducing the correlation by differencing and/or extra data 52