Multivariate Regression: Part I

Size: px

Start display at page:

Download "Multivariate Regression: Part I"

Jonah Benson
5 years ago
Views:

1 Topic 1 Multivariate Regression: Part I ARE/ECN 240 A Graduate Econometrics Professor: Òscar Jordà

2 Outline of this topic Statement of the objective: we want to explain the behavior of one variable as a function of other variables. Typical assumptions and why they are needed. Three approaches to pursue the objective given the assumptions: Method of Moments Ordinary Least Squares Maximum Likelihood Estimation 2

3 Objective We continue our evaluation on how to improve schools using the California data We began by asking if class size affects scores and we have data for both. However, there may be other explanatory factors and/or policy variables (e.g. increase overall expenditures per student). Here is a summary of the data 3

4 Policy Evaluation: Test Scores and Class Size The California Test Score Data Set (caschool.dta STATA data file) All K-6 and K-8 California school districts (n = 420) Variables: 5 th grade test scores (Stanford-9 achievement test, combined math and reading), district average (testscr) Student-teacher teacher ratio = no. of students in the district divided by no. full-time equivalent teachers (str) District parental average income in thousands of dollars(avginc) Average expenditures per student in dollars (expn_stu) 4

5 A look at the data Basic Statistics:. sum testscr str expn_stu avginc Variable Obs Mean Std. Dev. Min Max testscr str expn_stu avginc Correlation matrix. correlate testscr str expn_stu avginc (obs=420) testscr str expn_stu avginc testscr str expn_stu avginc

6 Statement t t of the Population Problem It is natural to postulate that: testscr i = constant + f(str i, expn_stu i, avginc i ) + error i for i = 1,,420., A good place to begin is to assume this relation is linear. Using the more general notation we will use in the course: y i = x i1 1 + ::: + x ik K + ² i for i =1;:::;n The left hand side is the endogenous or dependent variable, the x s are the regressors, or explanatory variables, and the the residuals or error terms 6

7 Some Features of the Linear Regression Model To investigate the problem we collect a random sample of fdata fy i ;x i1 ;:::;x ik g n i=1 Some vector and matrix notation: 2 y1 3 2 x1j 3 y = ; x j n 1 y n and X i 1 K = 6 4. j x nj = x i1 ::: x ik 7 5 for j =1; :::; K and X = x 1 ::: x K We use the convention that x 1 contains the constant t term So, in matrix notation, the linear regression model is y = X + ² 7

8 What we want using statistical concepts Assuming y, X have a joint distribution, we want to make statements about the conditional mean of y given X, notice that. sum testscr if avginc < 15 Variable Obs Mean Std. Dev. Min Max testscr sum testscr if avginc > 40 Variable Obs Mean Std. Dev. Min Max testscr Mathematically: m(x) =E(yjX) = Z 1 1 yf(yjx)dy 8

9 The Regression Error Given the previous definition: ² = y m(x) This implies the following properties for the regression error: E(²jX) =0 E(²) = for any function h(.) 4. E(h(X) 0 ²) =0 E(X 0 ²)=0 For example, to prove the first property: E(²jX) =E((y m(x))jx) =E(yjX) E(m(X)jX) = m(x) ( ) m(x) ( ) =0 9

10 Prediction: min MSE The conditional mean has the property that it minimizes the mean squared error (MSE) out of any function g(.), E(y g(x)) 2 = E(²+m(X) g(x)) 2 = E(² 2 )+2E(²(m(X) g(x))) + E(m(X) g(x)) 2 = E(² 2 )+E(m(X) g(x)) 2 >E(² 2 )ifm(x) 6= g(x) Here I abuse notation to indicate that, e.g. 0 ) E(" 2 ) = E("" 0 ) 10

11 Conditional Variance Just as we consider the conditional mean, we may explore how the variance of y varies with X, (X) =V (yjx) =E("" 0 jx) When it is the case that the variance is constant so that (X) =E("" 0 jx) =¾ 2 I n we say the error term is homoscedastic, otherwise we say it is heteroscedastic. 11

12 Normality If we assume y and X are jointly normally distributed, life gets easy (clearly a strong assumption) That is because we can use the projection formula for the joint normal to obtain the conditional mean of y i given X i. Here is how, if i yi X 0 i» N i ¹y ¹ X ; μ then E(y i jxi)=m(x 0 i)=¹ 0 yi (X0 i ¹ X ) V (y i) 0 1 i jx i ) =

13 Let s make some assumptions 1. Linearity: y = X + " 2. Full rank: 3. X is an n K matrix with rank K 2 E[²1 jx] 3 E(²jX) = = 0; hence E[²] =0andE[yjX] =X E[² n jx] 4. Homoscedasticity: V (²jX) =¾ 2 I n hence V (² i jx) =¾ 2 and Cov(² i ; ² j jx) = 0 for all i = j 5. Normality: ²jX» N(0;¾ 2 I n ) 13

14 Checking the assumptions in the data 1. Linearity: could be problematic theoretically. sum testscr if str <= 17 Variable Obs Mean Std. Dev. Min Max testscr sum testscr if str >17 & str <=20 Variable Obs Mean Std. Dev. Min Max testscr sum testscr if str >= 22.8 Variable Obs Mean Std. Dev. Min Max testscr sum testscr if str < 22.8 & str >= 19.8 Variable Obs Mean Std. Dev. Min Max testscr but testscr(14-17)-testscr(17-20) = 3.8 and testscr(20-23)-testscr(23-26) =

15 More Checking 2. X is full rank: this means that one (or more) regressors cannot be exact linear combinations of the others. Easiest is to check the correlation matrix of X:. correlate str expn_stu avginc (obs=420) str expn_stu avginc str expn_stu avginc Later we will discuss slightly more sophisticated ways of checking this 15

16 Final Checks Assumptions 3 (residuals have zero conditional mean) and d4(h (homoscedasticity) it we cannot check just yet. Assumption 5 is normality. This we can check with Jarque-Bera statistics and also looking at some histogram/density plots Dens sity Average Test Score (= (read_scr+math_scr)/2 );

17 Why do we make these assumptions? Linearity: not as strict as it sounds. Usual example, a Cobb-Douglas production function: Y = AL l K k! log(y )=log(a)+ l log(l)+ k log(k) y i = 1 + x i2 2 + x i3 3 + " i Beyond that, t we will discuss what to do with truly nonlinear specification later. For now, linearity makes derivations very convenient by using projection arguments 17

18 Multicolinearity X is a full rank matrix: easy, we cannot really identify parameters otherwise. An example, suppose 3 regressors such that x 1 = x 2 + x 3 y = x x x " y =(x 2 + x 3 ) 1 + x x " y = x 2 ( 1 + 2)+x 3 ( 1 + 3)+" y = x x " which means that 1; 2 and 3 cannot be separately identified. Mechanically, we run into numerical problems Exact colinearity is easy to detect, but approximate colinearity can affect regression results as well. 18

19 Conditional mean-zero errors This is a critical assumption, as we will see, it ensures that the model is properly specified and that the parameters estimates tend to their true values. Reasons why this assumption may not hold in practice have to do with misspecification problems: e.g. omitted variable bias, errors-invariables and endogeneity (only really applies when we want to emphasize analysis of causal relations as opposed to simple correlations) 19

20 Homoscedasticity 20 This assumption is often violated. However, it is easy to relax. It will not affect parameter estimates but it will affect how their standard errors are calculated (i.e., the efficiency of the estimator).. sum testscr if str <= 17 Variable Obs Mean Std. Dev. Min Max testscr sum testscr if str >17 & str <=20 Variable Obs Mean Std. Dev. Min Max testscr sum testscr if str >= 22.8 Variable Obs Mean Std. Dev. Min Max testscr sum testscr if str < 22.8 & str >= 19.8 Variable Obs Mean Std. Dev. Min Max testscr

21 Normality/Gaussianity Assuming the data are Gaussian allows us to use well known projection formulas and allows us to derive finite sample statistics However, the data is often not Gaussian. It turns out that using the thought experiment of increasing the sample size to infinity will allow us to use some probability limit theory under which the estimators will have a Normal distribution Hence the importance of having a random sample 21

22 Random Sample Let be i.i.d. fw i g n i=1 = fy i ;X i g n i=1 Then f(w 1 ; :::; w n )= f 1 (w 1 ; μ 1 ):::ff i (w i jw i 1 ; :::; w 1 ; μ i ):::ff n (w n jw n 1 ; :::; w 1 ; μ n ) = f(w 1 ; μ):::f(w n ; μ) i.e. notice the independence assumption in the first line, and the identical assumption in the second In time series, as long as the amount of dependence is limited, one can relax the indepedence assumption 22

23 Where are we so far? We have postulated a population model of how y relates to X y i = 1 + x i2 2 + ::: + x ik K + " i ; i =1; :::; n We have a random sample: Now we want to obtain the distribution ib ti of the parameters. The mean of the distribution is the parameter estimate and knowing the distribution is vital to do inference: b» D( ; ) 23

24 Methods of Moments Let s try to figure out how to estimate We will use the method of moments approach first. It consists on the analogy principle: i translate t a population moment condition into its equivalent sample moment condition (think LLN). For example: ¹ 1 X n yi 1 X n E(y ¹y ) = 0! y i ¹ y = 0 n!1 n n i=1 i=1 b¹ = 1 X n ¹ y yi y i n i=1 24

25 Deriving the MM estimator for linear regression Recall, one of the key assumptions in the linear regression model is: P n E("jX) =0! E(X 0 i=1 ") =0! X0 i " i X 0 " = =0 n n with Hence: y = X + ² E(X 0 ")=E(X 0 (y X )) = 0! X0 y n b =(X 0 X) 1 X 0 y X0 X n =0 25

26 Least Squares Linear Regression: Test Scores and Student-to-Teacher Ratio est Score T Student to Teacher Ratio Average Test Score (= (read_scr+math_scr)/2 ); Fitted values 26

27 Deriving the OLS estimator Consider the problem of minimizing the distance of the observations with respect to the regression line. Since we care about distance but not the sign of the error, we could use absolute values: this gives rise to the LAD estimator but it is not convenient because it is not differentiable Instead, by squaring the distance, the objective function can be optimized using derivative methods 27

28 Derivation of OLS Objective: min S( ) =E(" 2 i )! min In matrix algebra: 1 n nx " 2 i = 1 n i=1 nx (y i X i ) 2 i=1 "0 " (y X ) 0 (y X ) min S( ) = = n n General result: suppose f( ) is a real valued scalar function of. A necessary condition for a local optimum = b 28

29 Derivation of OLS (cont.) If the hessian is positive semidefinite, then ^ is a local minimum. Rules of = 2 @ K f @ 2 K@ 01 ::: 2 1@ 2 K@ 0K 3 0A 0A0 = A; A = A0 ; =(A + A 0 ) 0A = 0(A

30 Derivation of OLS (cont.) Recall: min S( ) = "0 " (y X ) 0 (y X ) = n n = y0 y 0X 0 n y y0 X 0X 0 X + n n n Applying the rules of matrix ) =0 X0 y X 0 μ y X 0 μ X X 0 X n n n n = 2 X0 y X n +2X0 n =0 ^ = (X 0 X) 1 X 0 y 2 S( ) X 0 X = 2 which is positive 0 n

31 Remarks No multicolinearity assumption ensures X X is invertible.. M is both b" = y X b = y X(X 0 X) 1 X 0 y = My symmetric and idempotent ( M = M 0 and M = M 2 ) and MX = 0 y by = y " = (I M)y = X(X 0 X) 1 X 0 y = Py where P is called the projection matrix. X 0^² = X 0 My = 0 by construction, the residuals are uncorrelated to the regressors. 31

32 Maximum Likelihood Estimator Assuming the random sample fy i ;X i g n i=1 is normally distributed and since the ^ are a linear combination of these, they will have a multivariate Gaussian distribution. Further, we now that the residuals are mean zero. And under the assumption of homoscedasticity, their covariance matrix is =¾. 2 I The multivariate normal is f("; ) =(2¼) n=2 j j 1=2 expf 1 2 (" ¹)0 1 (" ¹)g 32

33 MLE Taking the log (to construct the log likelihood function) and using the assumptions of the linear regression model: L("; ) = n 2 log(2¼) n 2 log ¾2 1 2¾ 2 "0 " = n log(2¼) n log ¾2 1 X )0 X ) 2 2 2¾ 2 (y (y and ¾ 2 Take derivatives with respect b = 1 2¾ 2 2 X 0 y + X 0 X =0! =(X 0 X) 1 X 0 = n b"0 b" =0! b¾ 2 = b"0 2 = b 2 ¾ 2 2¾ 4 n 33

34 Let s revisit Joint Normality and Linear Regression Recall: if y and X are jointly normal then yi X 0 i» N ¹y ¹ X ; μ E(y i jxi)=m(x 0 i)=¹ 0 yi (X i 0 ¹ X ) V (y 0 1 i jx i ) = Compare to OLS E(yjX)! y b = X b = X(X 0 X) 1 X 0 y! Ã! Ã! 1 X n 1 n 1 X y by yi 0 0 i = y i X i X n i=1 n i X i X i i=1 34

35 An example of GAUSS code for OLS Here is the basic code (a more complete file labeled topic1.prg does more things): load z[] = topic1.csv; vars = 4; z = reshape(z,rows(z)/vars,vars); rows(z)/vars vars); n = rows(z); y = z[.,1]; x = ones(rows(z),1) 1)~z[ z[.,2:cols(z)]; beta = inv(x'x)*x'y; beta;

36 Some Regression output From STATA. use "C:\Docs\teaching\140\STATA\caschool.dta". reg testscr str expn_stu avginc Source SS df MS Number of obs = 420 F( 3, 416) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = testscr Coef. Std. Err. t P> t [95% Conf. Interval] str expn_stu avginc _cons

37 Regression Output from GAUSS TOPIC 1 OLS EXAMPLE USING GAUSS' BUILT IN OLS ROUTINE Valid cases: 420 Dependent variable: Y Missing cases: 0 Deletion method: None Total SS: Degrees of freedom: 416 R-squared: Rbar-squared: Residual SS: Std error of est: F(3,416): Probability of F: Standard Prob Standardized Cor with Variable Estimate Error t-value > t Estimate Dep Var CONSTANT X X X

38 Measuring Goodness of Fit Intuition: If the regression is really good, then the residuals will be very close to zero and the predictions of the dependent variable will be close to y, most of the time. R-squared: is the standard measure of fit and is based on comparing the residual variance or the prediction variance, with the variance of the dependent variable. 38

39 R-squared Recall ^y = X ^ = X(X 0 X) 1 X 0 y = P y and y = P y +(I P )y = P y + My Definition: R 2 R2 = y0 P y y 0 y =1 y0 My y 0 y 2 [0; 1] where I use the properties: P = P and P P = P M = M and M M = M 39

40 Adjusted R-squared Takes advantage of the different degrees of freedom adjustments in computing sample variances: Pn (by by) R 2 = ( i=1 (y i 2 )=(n k) ( P n i=1 (y i y) 2 )=(n 1) =1 (P n i=1 b" i 2 )=(n k) P n ( i=1 1 (y i y) 2 )=(n 1) 2 [0; 1] Generally superior but most programs still report both 40

Econometrics Midterm Examination Answers

Econometrics Midterm Examination Answers March 4, 204. Question (35 points) Answer the following short questions. (i) De ne what is an unbiased estimator. Show that X is an unbiased estimator for E(X i