Multivariate Regression: Part I

Topic 1 Multivariate Regression: Part I ARE/ECN 240 A Graduate Econometrics Professor: Òscar Jordà

Outline of this topic Statement of the objective: we want to explain the behavior of one variable as a function of other variables. Typical assumptions and why they are needed. Three approaches to pursue the objective given the assumptions: Method of Moments Ordinary Least Squares Maximum Likelihood Estimation 2

Objective We continue our evaluation on how to improve schools using the California data We began by asking if class size affects scores and we have data for both. However, there may be other explanatory factors and/or policy variables (e.g. increase overall expenditures per student). Here is a summary of the data 3

Policy Evaluation: Test Scores and Class Size The California Test Score Data Set (caschool.dta STATA data file) All K-6 and K-8 California school districts (n = 420) Variables: 5 th grade test scores (Stanford-9 achievement test, combined math and reading), district average (testscr) Student-teacher teacher ratio = no. of students in the district divided by no. full-time equivalent teachers (str) District parental average income in thousands of dollars(avginc) Average expenditures per student in dollars (expn_stu) 4

A look at the data Basic Statistics:. sum testscr str expn_stu avginc Variable Obs Mean Std. Dev. Min Max testscr 420 654.1565 19.05335 605.5555 706.7575 str 420 19.64043 1.891812 14 25.8 expn_stu 420 5312.408 633.9371 3926.07 7711.507 avginc 420 15.31659 7.22589 5.335 55.328 Correlation matrix. correlate testscr str expn_stu avginc (obs=420) testscr str expn_stu avginc testscr 1.0000 str -0.2264 1.0000 expn_stu 0.1913-0.6200 1.0000 avginc 0.7124-0.2322 0.3145 1.0000 5

Statement t t of the Population Problem It is natural to postulate that: testscr i = constant + f(str i, expn_stu i, avginc i ) + error i for i = 1,,420., A good place to begin is to assume this relation is linear. Using the more general notation we will use in the course: y i = x i1 1 + ::: + x ik K + ² i for i =1;:::;n The left hand side is the endogenous or dependent variable, the x s are the regressors, or explanatory variables, and the the residuals or error terms 6

Some Features of the Linear Regression Model To investigate the problem we collect a random sample of fdata fy i ;x i1 ;:::;x ik g n i=1 Some vector and matrix notation: 2 y1 3 2 x1j 3 y = 6 7 4. 5 ; x j n 1 y n and X i 1 K = 6 4. j x nj = x i1 ::: x ik 7 5 for j =1; :::; K and X = x 1 ::: x K We use the convention that x 1 contains the constant t term So, in matrix notation, the linear regression model is y = X + ² 7

What we want using statistical concepts Assuming y, X have a joint distribution, we want to make statements about the conditional mean of y given X, notice that. sum testscr if avginc < 15 Variable Obs Mean Std. Dev. Min Max testscr 255 645.1076 15.45333 605.55 683.4. sum testscr if avginc > 40 Variable Obs Mean Std. Dev. Min Max testscr 9 695.7333 7.055042 681.9 706.75 Mathematically: m(x) =E(yjX) = Z 1 1 yf(yjx)dy 8

The Regression Error Given the previous definition: ² = y m(x) This implies the following properties for the regression error: 1. 2. E(²jX) =0 E(²) =0 3. 0 for any function h(.) 4. E(h(X) 0 ²) =0 E(X 0 ²)=0 For example, to prove the first property: E(²jX) =E((y m(x))jx) =E(yjX) E(m(X)jX) = m(x) ( ) m(x) ( ) =0 9

Prediction: min MSE The conditional mean has the property that it minimizes the mean squared error (MSE) out of any function g(.), E(y g(x)) 2 = E(²+m(X) g(x)) 2 = E(² 2 )+2E(²(m(X) g(x))) + E(m(X) g(x)) 2 = E(² 2 )+E(m(X) g(x)) 2 >E(² 2 )ifm(x) 6= g(x) Here I abuse notation to indicate that, e.g. 0 ) E(" 2 ) = E("" 0 ) 10

Conditional Variance Just as we consider the conditional mean, we may explore how the variance of y varies with X, (X) =V (yjx) =E("" 0 jx) When it is the case that the variance is constant so that (X) =E("" 0 jx) =¾ 2 I n we say the error term is homoscedastic, otherwise we say it is heteroscedastic. 11

Normality If we assume y and X are jointly normally distributed, life gets easy (clearly a strong assumption) That is because we can use the projection formula for the joint normal to obtain the conditional mean of y i given X i. Here is how, if i yi X 0 i» N i ¹y ¹ X ; μ 11 12 21 22 then E(y i jxi)=m(x 0 i)=¹ 0 yi + 12 1 22 (X0 i ¹ X ) V (y i) 0 1 i jx i ) = 11 12 22 21 12

Let s make some assumptions 1. Linearity: y = X + " 2. Full rank: 3. X is an n K matrix with rank K 2 E[²1 jx] 3 E(²jX) = 6 7 4. 5 = 0; hence E[²] =0andE[yjX] =X E[² n jx] 4. Homoscedasticity: V (²jX) =¾ 2 I n hence V (² i jx) =¾ 2 and Cov(² i ; ² j jx) = 0 for all i = j 5. Normality: ²jX» N(0;¾ 2 I n ) 13

Checking the assumptions in the data 1. Linearity: could be problematic theoretically. sum testscr if str <= 17 Variable Obs Mean Std. Dev. Min Max testscr 36 660.4139 23.03868 618.05 704.3. sum testscr if str >17 & str <=20 Variable Obs Mean Std. Dev. Min Max testscr 207 656.6229 18.56458 606.75 706.75. sum testscr if str >= 22.8 Variable Obs Mean Std. Dev. Min Max testscr 19 647.0395 16.94368 622.05 676.85. sum testscr if str < 22.8 & str >= 19.8 Variable Obs Mean Std. Dev. Min Max testscr 184 650.3753 17.83076 605.55 694.8 but testscr(14-17)-testscr(17-20) = 3.8 and testscr(20-23)-testscr(23-26) = 3.3 14

More Checking 2. X is full rank: this means that one (or more) regressors cannot be exact linear combinations of the others. Easiest is to check the correlation matrix of X:. correlate str expn_stu avginc (obs=420) str expn_stu avginc str 1.0000 expn_stu -0.6200 1.0000 avginc -0.2322 0.3145 1.0000 Later we will discuss slightly more sophisticated ways of checking this 15

Final Checks Assumptions 3 (residuals have zero conditional mean) and d4(h (homoscedasticity) it we cannot check just yet. Assumption 5 is normality. This we can check with Jarque-Bera statistics and also looking at some histogram/density plots.005.01 Dens.015.02 sity.025 16 0 600 620 640 660 680 700 Average Test Score (= (read_scr+math_scr)/2 );

Why do we make these assumptions? Linearity: not as strict as it sounds. Usual example, a Cobb-Douglas production function: Y = AL l K k! log(y )=log(a)+ l log(l)+ k log(k) y i = 1 + x i2 2 + x i3 3 + " i Beyond that, t we will discuss what to do with truly nonlinear specification later. For now, linearity makes derivations very convenient by using projection arguments 17

Multicolinearity X is a full rank matrix: easy, we cannot really identify parameters otherwise. An example, suppose 3 regressors such that x 1 = x 2 + x 3 y = x 1 1 + x 2 2 + x 3 3 + " y =(x 2 + x 3 ) 1 + x 2 2 + x 3 3 + " y = x 2 ( 1 + 2)+x 3 ( 1 + 3)+" y = x 2 2 + x 3 3 + " which means that 1; 2 and 3 cannot be separately identified. Mechanically, we run into numerical problems Exact colinearity is easy to detect, but approximate colinearity can affect regression results as well. 18

Conditional mean-zero errors This is a critical assumption, as we will see, it ensures that the model is properly specified and that the parameters estimates tend to their true values. Reasons why this assumption may not hold in practice have to do with misspecification problems: e.g. omitted variable bias, errors-invariables and endogeneity (only really applies when we want to emphasize analysis of causal relations as opposed to simple correlations) 19

Homoscedasticity 20 This assumption is often violated. However, it is easy to relax. It will not affect parameter estimates but it will affect how their standard errors are calculated (i.e., the efficiency of the estimator).. sum testscr if str <= 17 Variable Obs Mean Std. Dev. Min Max testscr 36 660.4139 23.03868 618.05 704.3. sum testscr if str >17 & str <=20 Variable Obs Mean Std. Dev. Min Max testscr 207 656.6229 18.56458 606.75 706.75. sum testscr if str >= 22.8 Variable Obs Mean Std. Dev. Min Max testscr 19 647.0395 16.94368 622.05 676.85. sum testscr if str < 22.8 & str >= 19.8 Variable Obs Mean Std. Dev. Min Max testscr 184 650.3753 17.83076 605.55 694.8

Normality/Gaussianity Assuming the data are Gaussian allows us to use well known projection formulas and allows us to derive finite sample statistics However, the data is often not Gaussian. It turns out that using the thought experiment of increasing the sample size to infinity will allow us to use some probability limit theory under which the estimators will have a Normal distribution Hence the importance of having a random sample 21

Random Sample Let be i.i.d. fw i g n i=1 = fy i ;X i g n i=1 Then f(w 1 ; :::; w n )= f 1 (w 1 ; μ 1 ):::ff i (w i jw i 1 ; :::; w 1 ; μ i ):::ff n (w n jw n 1 ; :::; w 1 ; μ n ) = f(w 1 ; μ):::f(w n ; μ) i.e. notice the independence assumption in the first line, and the identical assumption in the second In time series, as long as the amount of dependence is limited, one can relax the indepedence assumption 22

Where are we so far? We have postulated a population model of how y relates to X y i = 1 + x i2 2 + ::: + x ik K + " i ; i =1; :::; n We have a random sample: Now we want to obtain the distribution ib ti of the parameters. The mean of the distribution is the parameter estimate and knowing the distribution is vital to do inference: b» D( ; ) 23

Methods of Moments Let s try to figure out how to estimate We will use the method of moments approach first. It consists on the analogy principle: i translate t a population moment condition into its equivalent sample moment condition (think LLN). For example: ¹ 1 X n yi 1 X n E(y ¹y ) = 0! y i ¹ y = 0 n!1 n n i=1 i=1 b¹ = 1 X n ¹ y yi y i n i=1 24

Deriving the MM estimator for linear regression Recall, one of the key assumptions in the linear regression model is: P n E("jX) =0! E(X 0 i=1 ") =0! X0 i " i X 0 " = =0 n n with Hence: y = X + ² E(X 0 ")=E(X 0 (y X )) = 0! X0 y n b =(X 0 X) 1 X 0 y X0 X n =0 25

Least Squares Linear Regression: Test Scores and Student-to-Teacher Ratio est Score 6 00 620 64 40 T 660 680 700 15 20 25 Student to Teacher Ratio Average Test Score (= (read_scr+math_scr)/2 ); Fitted values 26

Deriving the OLS estimator Consider the problem of minimizing the distance of the observations with respect to the regression line. Since we care about distance but not the sign of the error, we could use absolute values: this gives rise to the LAD estimator but it is not convenient because it is not differentiable Instead, by squaring the distance, the objective function can be optimized using derivative methods 27

Derivation of OLS Objective: min S( ) =E(" 2 i )! min In matrix algebra: 1 n nx " 2 i = 1 n i=1 nx (y i X i ) 2 i=1 "0 " (y X ) 0 (y X ) min S( ) = = n n General result: suppose f( ) is a real valued scalar function of. A necessary condition for a local optimum is @f =0 @ = b 28

Derivation of OLS (cont.) If the hessian is positive semidefinite, then ^ is a local minimum. Rules of matrix differentiation: @f @ = 2 6 4 @f @ 1. @f @ K 3 7 5 ; 2 @2f @ 2 f = 6 @ @ 0 4. 2 2 @ 1@ 01 @ 2 f @ K@ 01 ::: ::: @ 2 f @ 1@ 0K. @ 2 f @ K@ 0K 3 7 5 @A @ 0 @ 0A @ @ 0A0 = A; A 0 @ = A0 ; =(A + A 0 ) ; @ 0A = 0(A 0 + A) @ 0 29

Derivation of OLS (cont.) Recall: min S( ) = "0 " (y X ) 0 (y X ) = n n = y0 y 0X 0 n y y0 X 0X 0 X + n n n Applying the rules of matrix differentiation @S( ) =0 X0 y X 0 μ y X 0 μ X X 0 X 0 + + =0 @ n n n n = 2 X0 y X n +2X0 n =0 ^ = (X 0 X) 1 X 0 y 30 @ 2 S( ) X 0 X = 2 which is positive de nite @ @ 0 n

Remarks No multicolinearity assumption ensures X X is invertible.. M is both b" = y X b = y X(X 0 X) 1 X 0 y = My symmetric and idempotent ( M = M 0 and M = M 2 ) and MX = 0 y by = y " = (I M)y = X(X 0 X) 1 X 0 y = Py where P is called the projection matrix. X 0^² = X 0 My = 0 by construction, the residuals are uncorrelated to the regressors. 31

Maximum Likelihood Estimator Assuming the random sample fy i ;X i g n i=1 is normally distributed and since the ^ are a linear combination of these, they will have a multivariate Gaussian distribution. Further, we now that the residuals are mean zero. And under the assumption of homoscedasticity, their covariance matrix is =¾. 2 I The multivariate normal is f("; ) =(2¼) n=2 j j 1=2 expf 1 2 (" ¹)0 1 (" ¹)g 32

MLE Taking the log (to construct the log likelihood function) and using the assumptions of the linear regression model: L("; ) = n 2 log(2¼) n 2 log ¾2 1 2¾ 2 "0 " = n log(2¼) n log ¾2 1 X )0 X ) 2 2 2¾ 2 (y (y and ¾ 2 Take derivatives with respect to @L b 1 @ = 1 2¾ 2 2 X 0 y + X 0 X =0! =(X 0 X) 1 X 0 y @L = n 12 + 1 2 b"0 b" =0! b¾ 2 = b"0 b" @¾ 2 = b 2 ¾ 2 2¾ 4 n 33

Let s revisit Joint Normality and Linear Regression Recall: if y and X are jointly normal then yi X 0 i» N ¹y ¹ X ; μ 11 12 21 22 E(y i jxi)=m(x 0 i)=¹ 0 yi + 12 1 22 (X i 0 ¹ X ) V (y 0 1 i jx i ) = 11 12 22 21 Compare to OLS E(yjX)! y b = X b = X(X 0 X) 1 X 0 y! Ã! Ã! 1 X n 1 n 1 X y by yi 0 0 i = y i X i X n i=1 n i X i X i i=1 34

An example of GAUSS code for OLS Here is the basic code (a more complete file labeled topic1.prg does more things): load z[] = topic1.csv; vars = 4; z = reshape(z,rows(z)/vars,vars); rows(z)/vars vars); n = rows(z); y = z[.,1]; x = ones(rows(z),1) 1)~z[ z[.,2:cols(z)]; beta = inv(x'x)*x'y; beta; 669.74510-1.3257657-0.0034947061 1.8943746 35

Some Regression output From STATA. use "C:\Docs\teaching\140\STATA\caschool.dta". reg testscr str expn_stu avginc Source SS df MS Number of obs = 420 F( 3, 416) = 149.86 Model 79004.2997 3 26334.7666 Prob > F = 0.0000 Residual 73105.294 416 175.73388 R-squared = 0.5194 Adj R-squared = 0.5159 Total 152109.594 419 363.030056 Root MSE = 13.256 testscr Coef. Std. Err. t P> t [95% Conf. Interval] str -1.325765.4368463-3.03 0.003-2.184466 -.4670634 expn_stu -.0034947.0013358-2.62 0.009 -.0061205 -.000869 avginc 1.894375.0945335 20.04 0.000 1.708552 2.080198 _cons 669.7451 13.97392 47.93 0.000 642.2768 697.2134 36

Regression Output from GAUSS TOPIC 1 OLS EXAMPLE USING GAUSS' BUILT IN OLS ROUTINE Valid cases: 420 Dependent variable: Y Missing cases: 0 Deletion method: None Total SS: 152109.594 Degrees of freedom: 416 R-squared: 0.519 Rbar-squared: 0.516 Residual SS: 73105.309 Std error of est: 13.256 F(3,416): 149.856 Probability of F: 0.000 Standard Prob Standardized Cor with Variable Estimate Error t-value > t Estimate Dep Var ------------------------------------------------------------------------------- CONSTANT 669.745098 13.973921 47.928215 0.000 --- --- X1-1.325766 0.436846-3.034856 0.003-0.131636-0.226363 X2-0.003495003495 0.001336 001336-2.616202 0.009009-0.116275 0.191273 X3 1.894375 0.094534 20.039176 0.000 0.718432 0.712431 37

Measuring Goodness of Fit Intuition: If the regression is really good, then the residuals will be very close to zero and the predictions of the dependent variable will be close to y, most of the time. R-squared: is the standard measure of fit and is based on comparing the residual variance or the prediction variance, with the variance of the dependent variable. 38

R-squared Recall ^y = X ^ = X(X 0 X) 1 X 0 y = P y and y = P y +(I P )y = P y + My Definition: R 2 R2 = y0 P y y 0 y =1 y0 My y 0 y 2 [0; 1] where I use the properties: P = P and P P = P M = M and M M = M 39

Adjusted R-squared Takes advantage of the different degrees of freedom adjustments in computing sample variances: Pn (by by) R 2 = ( i=1 (y i 2 )=(n k) ( P n i=1 (y i y) 2 )=(n 1) =1 (P n i=1 b" i 2 )=(n k) P n ( i=1 1 (y i y) 2 )=(n 1) 2 [0; 1] Generally superior but most programs still report both 40