THE MULTIVARIATE LINEAR REGRESSION MODEL

Size: px

Start display at page:

Download "THE MULTIVARIATE LINEAR REGRESSION MODEL"

Samson Lawrence Lee
5 years ago
Views:

1 THE MULTIVARIATE LINEAR REGRESSION MODEL

2 Why multiple regression analysis? Model with more than 1 independent variable: y 0 1x1 2x2 u It allows : -Controlling for other factors, and get a ceteris paribus effect. Ex: y: wage, x 1 : education, x 2 : IQ => IQ is no more part of u. =>better job at inferring causality. -better predictions: more of the variation in y can be explained.

3 Why multiple regression analysis?(2) It also allows: - estimating non-linear relationship. Ex: quadratic relationship between wage and experience. wage 0 1 exp er 2 exp er Careful: no ceteris paribus interpretation here! -Testing joint hypotheses on parameters. Key assumption: E 2 u ( u 2 x1, x ) 0

4 Example: Determinants of wage Source: Wooldridge, WAGE1.dta (data from 1976 Current Population Survey) Population model: wage educ exp er u use sum wage educ exper Variable Obs Mean Std. Dev. Min Max wage educ exper corr educ exper (obs=526) educ exper educ exper

5 Example: Determinants of wage(2). reg wage educ Source SS df MS Number of obs = 526 F( 1, 524) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = wage Coef. Std. Err. t P> t [95% Conf. Interval] educ _cons reg wage educ exper Source SS df MS Number of obs = 526 F( 2, 523) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = wage Coef. Std. Err. t P> t [95% Conf. Interval] educ exper _cons

6 Example: Determinants of wage(3) Interpretation: A one-year increase in education is predicted to increase hourly wage by 64 cent, ceteris paribus. An additional year of experience is predicted to increase wage by 7 cents, ceteris paribus. Comparing with the results of the bivariate model, we now obtain a higher estimate of returns to education. We suspect the results of the bivariate case to be biased, since experience is correlated with education, and experience affects wage too. I.e. the zero conditional mean assumptions was likely to be violated in the bivariate case. In other words: in the bivariate case, the impact of education accounted for the impact of experience as well. As the correlation between the two variables is negative, the estimate of the impact of education on wage was downward biased.

7 Example: introducing quadratics What if the impact of a variable is not constant? wage 0 1 exp er 2 exp er 2 u Introducing quadratics allows us to: model an increasing or decreasing effect of experience when experience increases. wage exp er exp er Determine the turning point of the effect: exp er 1 2 2

8 . list exper* in 1/10 exper expersq reg wage exper* Source SS df MS Number of obs = 526 F( 2, 523) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = wage Coef. Std. Err. t P> t [95% Conf. Interval] exper expersq _cons Interpretation: For low levels of experience, wage is predicted to increase with experience, ceteris paribus. The negative sign on the square term indicates however that as the number of years of experience increases, the returns to an additional year decreases. In fact we can calculate the turning point, i.e. the point where the marginal returns to education are 0. This happens at.298/(2*006), i.e. approximately at 25 years of experience.

9 Stata commands:. scatter wage exper qfit wage exper,name(multiple). scatter wage exper lfit wage exper,name(simple). graph combine multiple simple,saving(simple_multiple) (file simple_multiple.gph saved) exper exper wage Fitted values wage Fitted values

10 The model with k independent variables The general multiple linear regression model (also called the multiple regression model) can be written in the population as: y x... x u. Notation: x 1, x2,..., x k are the independent variables, with k the number of independent variables, and x ik the value of variable x k for observation i. Key assumption:. E ( 2 0 u x1, x,..., x k ) k k

11 Deriving the OLS estimates The estimated model is:. We want to estimate =>k+1 OLS estimates. Minimize sum of squares of residuals: =>First order conditions (using calculus, see Appendix 3A) give k+1 linear equations with k+1 unknown: ˆ,..., ˆ, ˆ ) ˆ... ˆ ˆ ( 1 0 ik k i n i i x x y Min k k x k x x y ˆ... ˆ ˆ ˆ ˆ k ˆ,..., ˆ, ˆ, ˆ k ˆ,..., ˆ, ˆ, ˆ 2 1 0

12 Interpretation of OLS estimates ˆ Estimated model: yˆ (3.11) 0 1x1 2x2... kx k How do we interpret ˆ ˆ ˆ? 1, 2,..., k We can obtain from (3.11) the predicted change in y given changes in the x i : yˆ ˆ x ˆ x ˆ kx k. The coefficient on x 1 measures the change in y due to a one-unit increase in x 1, holding all other independent variables fixed. That is, if we hold x 2, x 3, x k constant: yˆ ˆ 1x1=>allows ceteris paribus estimation, even if data were not collected this way!! ˆ ˆ ˆ

13 OLS Fitted Values and Residuals For obs i, the fitted value is simply: y ˆ ˆ x... ˆ x The actual value y i will not equal the predicted value. Residual: ˆi 0 1 i1 uˆ i y i yˆ Same properties of fitted values and residuals: -The sample average of the residuals is zero. i -The sample covariance between the x i and residuals is zero=> between fitted values and residuals also. -The average point is always on the regression line. k ik ŷ i

14 Simple Vs. Multiple regression estimates ~ ~ ~ x Simple regression model: y Multiple regression model: yˆ ˆ 0 ˆ 1x1 ˆ 2x2 ~ 1 ˆ 1 if: - the partial effect of x 2 is zero in the sample -x 1 and x 2 are uncorrelated in the sample ~ 1 ˆ 1 if: : - the partial effect of x 2 is small in the sample -x 1 and x 2 are weakly correlated in the sample.

15 How good is the estimation at explaining the dependent variable? Measure of sample variation: Total Sum of Squares: SST n i1 ( y i y) 2 Part that is explained by x: Explained Sum of Squares: SSE n i1 ( yˆ i y) 2 Part that is unexplained by x: Residual Sum of Squares: SSR n i1 2 uˆ i Just as in the simple regression case, SST=SSE+SSR.

16 Goodness of fit: the R-squared 2 SSE SSR R 1 SST SST R 2 is the proportion of the sample variation in y i that is explained by the OLS regression line. R 2 lies between 0 and 1. Higher value indicates a better fit, but: R 2 never decreases, and it usually increases when another independent variable is added to a regression=> poor tool for deciding which model to choose. We will need another criterion to choose whether to include a variable.

17 Example: explaining arrest records Population model: First, we estimate the model without the variable avgsen. We obtain:. use reg narr86 pcnv ptime86 qemp86 Source SS df MS Number of obs = 2725 F( 3, 2721) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE =.8416 narr86 Coef. Std. Err. t P> t [95% Conf. Interval] pcnv ptime qemp _cons

18 So we obtain the estimated equation: Nârr86= pcnv ptime qemp86 n = 2,725, R 2 =.0413 The three variables pcnv, ptime86, and qemp86 explain about 4.1 percent of the variation in narr86. What happens if pcnv increases by 50%? nârr86 = -.150(.5) = What happens if ptime86 increases from 0 to 12? predicted arrests for a particular man falls by 0.034(12)=0.408 What if we include avgsen in the model?

19 . reg narr86 avgsen pcnv ptime86 qemp86 Source SS df MS Number of obs = 2725 F( 4, 2720) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = narr86 Coef. Std. Err. t P> t [95% Conf. Interval] avgsen pcnv ptime qemp _cons R 2 increases from.0413 to.0422, a practically small effect. The sign of the coefficient on avgsen is also unexpected: a longer average sentence length increases criminal activity. =>What should we conclude about the two models?

20 Unbiasedness of OLS Remember assumptions: Linearity (in parameters!): y 0 1x... k xk u Random sampling: yi 0 1xi 1... k xik ui i=1,2,,n. Zero conditional mean: E( u x1, x2,..., x k ) 0 No perfect collinearity: In the sample (and therefore in the population), none of the independent variables is constant, and there are no exact linear relationships among the independent variables.. Using all these assumptions we can prove the first important statistical property of OLS: unbiasedness. E( ˆ j ) j, j 1,2,..., k

21 Violations of zero conditional mean ZCM assumption will not be true if the functional relationship between the explained and explanatory variables is misspecified in equation : 2 Ex1: True model: cons 0 1inc 2inc u Estimated model: cons 0 1inc u Ex2: True model: log( wage) 0 1educ u Estimated model: wage 0 educ u 1 We omit a variable that is correlated with the x i =>endogeneity.

22 Violations of no perfect collinearity Assumption is violated if there exists (a, b) such that x 1 = a+bx 2. one variable can t be multiple from another (Ex: inc and inc 2 are ok, but log(inc) and log(inc 2 ) are not ok.) One variable can t be the sum of some of the others When variables are shares: can t include all the shares. Practical Note: Stata will not estimate models with perfect collinearity. Solution: drop any of the perfectly correlated variables!

23 Perfect collinearity could also fail if n<k+1: to estimate k+1 parameters, we need at least k+1observations. Bad luck in collecting the sample.

24 Examples of perfect collinearity : Voting outcomes and campaign expenditures Source: Wooldridge, VOTE1.dta (From M. Barone and G. Ujifusa, The Almanac of American Politics, Washington, DC: National Journal.) two-party races for the US House of Representatives in bcuse vote1. ge shareb=100-sharea Data description: votea: percent vote for A expenda: campaign expends. by A, $1000s expendb: campaign expends. by B, $1000s sharea: 100*(expendA/(expendA+expendB)). su votea expenda expendb sharea shareb Variable Obs Mean Std. Dev. Min Max votea expenda expendb sharea shareb

25 . reg votea sharea shareb note: sharea omitted because of collinearity Source SS df MS Number of obs = 173 F( 1, 171) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = votea Coef. Std. Err. t P> t [95% Conf. Interval] sharea (omitted) shareb _cons reg votea shareb Source SS df MS Number of obs = 173 F( 1, 171) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = votea Coef. Std. Err. t P> t [95% Conf. Interval] shareb _cons reg votea sharea Source SS df MS Number of obs = 173 F( 1, 171) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = votea Coef. Std. Err. t P> t [95% Conf. Interval] sharea _cons

26 Interpretation: The variables sharea and shareb are perfectly collinear (sharea = 100-shareB). Therefore they cannot both be used as independent variables in the regression. STATA will automatically drop one=> First two estimates are the same. ShareA as only explanatory variable and ShareB as only explanatory variable yield the same results: Increasing the share of expenditures of B with one percentage point (=a one percentage point decrease of the share of A), is predicted to decrease the share of votes for A by.46 percentage points, ceteris paribus.

27 Omitted variable bias Let y be the true model. 0 1x1 2x2 u All 4 assumptions are verified. When estimated, it gives: yˆ ˆ 0 ˆ 1x1 ˆ 2x2 We want the effect of x 1 on y. What happens if we regress y on x 1 only? The estimated (underspecified) model then is: ~ ~ ~ y x ~ ~ ~ is biased for β 1 : ( ). 1 E omittedbias

28 About the omitted bias: Two cases when ~ 1 is not biased: When β 2 =0 so that x 2 does not appear in the true model. ~ If 0, i.e. if and only if x 1 and x 2 are uncorrelated in 1 the sample. Direction of the omitted variable bias: 2-Variable case. Corr(x 1,x 2 )>0 Corr(x 1,x 2 )<0 β 2 >0 Positive bias Negative bias β 2 <0 Negative bias Positive bias

29 Example 3: Impact of IQ on relationship between wage and education Source: WAGE2.dta, Wooldridge, (data used in M. Blackburn and D. Neumark (1992), Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials, Quarterly Journal of Economics 107, ). use su wage IQ educ Variable Obs Mean Std. Dev. Min Max wage IQ educ

30 . reg wage educ Source SS df MS Number of obs = 935 F( 1, 933) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = wage Coef. Std. Err. t P> t [95% Conf. Interval] educ _cons reg wage educ IQ Source SS df MS Number of obs = 935 F( 2, 932) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = wage Coef. Std. Err. t P> t [95% Conf. Interval] educ IQ _cons corr educ IQ (obs=935) educ IQ educ IQ

31 Interpretation: Intellectual ability is likely to affect both people s wage and their education. Therefore a simple regression of wage on education is likely to be biased, as intellectual ability will be included in the error term, resulting in a violation of the zero conditional mean assumption. If we use IQ (as a proxy for intellectual ability) we correct for this bias. Given that IQ and education are positively correlated, and IQ and wage are also positively correlated, we suspect the coefficient in the bivariate model to be positively biased (i.e overestimated). This is confirmed when we run the regression including IQ. The coefficient of education drops from 60 to 42. An increase in the IQ score of 1, is predicted to increase wage by 5$ per month, ceteris paribus. Given that the correlation between education and IQ, and between IQ and wage is strong, the bias in the bivariate model was large.

32 Omitted variable bias: multiple case What happens if multiple regressors? Correlation between a single explanatory variable and the error generally results in all OLS estimators being biased. If focus is on the relationship between a particular explanatory variable, say x1, and the key omitted factor, ignoring all other explanatory variables is a valid practice only when each one is uncorrelated with x1, but it is still a useful guide.

33 Including irrelevant variables Overspecifying the model: one (or more) of the independent variables is included in the model even though it has no partial effect on y in the population. (That is, its population coefficient is zero.) No bias (when 4 assumptions hold), but not harmless though: undesirable effects on the variances of the OLS estimators.

34 Back to the Broad Picture We are interested in understanding the effect of a variable x on variable y. Need a coefficient estimate, need to know its sign and magnitude. Need to know how precise this estimate is => need to find about its variance. 4 assumptions give us unbiasedness of the coefficient estimates. Need one more assumption to obtain an unbiased estimate of the variance of the coefficient estimate, and to know OLS are efficient.

35 5 Gauss-Markov assumptions: 4+1 Linearity Random Sampling Zero Conditional Mean No Perfect Collinearity Homoskedasticity: variance in the error term, conditional on the explanatory variables, is constant 2 Var ( u x 1, x2,..., x k ) Under these conditions, 2 OLS estimate of error variance is unbiased: E( ˆ 2 ) We derive formula for sampling variance of the OLS coefficients. OLS is efficient (i.e. variance is the smallest variance possible).

36 Sampling variance of the OLS coefficients Under Assumptions 1 through 5, conditional on the sample values of the independent variables, For j=1,2,,k where is the total variation in x j, and is the R-squared from regressing x j on all other independent variables. Why should we care about its size?

37 Unbiased estimator of σ 2 Need an unbiased estimator of σ 2 to get an unbiased estimator of Var ( ˆ j ). σ 2 =E(u 2 ) => logical estimator would be u 2 i. Problem: the errors are not observable! But the residuals are. => ˆ? u 2 i n An unbiased estimator of σ 2 is: n SSR ˆ uˆ i n k 1 i1 n k 1 Why n-k-1? Degree of freedom=number of obs-number of parameters n

38 More precise estimate when: 2 2 lower: more noise in the equation (a larger ) makes it more difficult to estimate the partial effect of any of the x j on y. To reduce it, increase nb of x j. Larger total variation in x j : to increase it, increase sample size. SST j =0 is ruled out by assumption 4. Less correlation between the x j. Two extreme cases: =0 smallest variance. =1 perfect collinearity. (no way as it is ruled out by assumption 4)

39 What if R j 2 is close to 1? This is called multicollinearity. It does not violate assumption 4, but still is a problem as variance of the estimator increases. How to reduce multicollinearity? Dropping a variable? How big is the multicollinearity issue depends on which variable is your focus.

40 Example of Multicollinearity Relationship between education and family background Source: WAGE2.dta, Wooldridge, (data used in M. Blackburn and D. Neumark (1992), Unobserved Ability, Efficiency Wages, and Interindustry Wage Differentials, Quarterly Journal of Economics 107, ). use su educ sibs meduc feduc Variable Obs Mean Std. Dev. Min Max educ sibs meduc feduc In order to predict educational attainment, should we include all of these variables?

41 Omitted variable bias vs multicollinearity What happens if we omit father s education?. reg educ sibs meduc Source SS df MS Number of obs = 857 F( 2, 854) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = educ Coef. Std. Err. t P> t [95% Conf. Interval] sibs meduc _cons These estimates are biased if feduc affects educ and is correlated with meduc and/or sibs.. corr feduc meduc sibs (obs=722). corr educ feduc (obs=741) feduc meduc sibs feduc meduc sibs Corr(feduc,meduc) is high => there is Also a problem of multicollinearity if Both are in the model. educ feduc educ feduc

42 Should we still include the omitted variable?. reg educ sibs meduc feduc Source SS df MS Number of obs = 722 F( 3, 718) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = educ Coef. Std. Err. t P> t [95% Conf. Interval] sibs meduc feduc _cons Because of the high correlation between meduc and feduc, the standard error of the coefficient of meduc increased substantially. Given that multicollinearity is not a violation of any assumption, we prefer the second over the first estimation.

43 Try to redefine the research question: Create a third variable to sum up the information contained in the two variables meduc and feduc.. gen avpareduc=(feduc+meduc)/2 (213 missing values generated). reg educ sibs avpareduc Source SS df MS Number of obs = 722 F( 2, 719) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = educ Coef. Std. Err. t P> t [95% Conf. Interval] sibs avpareduc _cons Note: if x 1 is uncorrelated with x 2 and x 3, and x 1 is the variable of interest, then we do not really care whether x 2 and x 3 are correlated. So include x 3, it will make a better case for causality, and only the variances of the estimators of the coefficients of x 2 and x 3 will increase.

44 Misspecification Let y 0 1x1 2x2 u be the true model. All Gauss-Markov assumptions are ok. We consider two estimators of β 1 : ˆ from ˆ ˆ ˆ 1 yˆ, and 0 1x1 2x2 ~ 1 from the estimated (underspecified) model ~ ~ ~ y x Which one is the best? If bias is the criterion: ˆ 1 If variance is the criterion? will be better.

45 Trade-off variance vs bias We have equality if x1 and x2 are uncorrelated. If not: ~ When β 2 0: is biased, ˆ is not, and 1 1 ~ When β 2 =0: and ˆ are unbiased, and 1 1 Why should we prefer 1? -variances will decrease when n increases. ˆ -when we omit x 2 and β 2 0, variance of 1 bigger than it seems because x 2 is in error term, so bigger σ. ~

46 Example of misspecification. regress educ sibs meduc feduc brthord Source SS df MS Number of obs = 663 F( 4, 658) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = educ Coef. Std. Err. t P> t [95% Conf. Interval] sibs meduc feduc brthord _cons regress educ sibs meduc feduc if brthord!=. Source SS df MS Number of obs = 663 F( 3, 659) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = educ Coef. Std. Err. t P> t [95% Conf. Interval] sibs meduc feduc _cons The first regression suggests that birthorder has no significant effect on education. There has to be however a correlation between the number of siblings and the birthorder, which causes some multicollinearity. As a result the standard error for the coefficient of sibs in the first model is larger than in the second model.

47 Efficiency of OLS: Gauss-Markov theorem Under first four assumptions, OLS estimators are unbiased. But maybe there are other estimators with smaller variances? GM theorem: If assumptions 1 to 5 are satisfied, the OLS gives us the Best Linear Unbiased Estimators (BLUE) Unbiased: Assumptions 1 to 4: E( ˆ j ) j, j 1..., k Best: Smallest variances=>most precise. Linear: ˆ j can be written as a linear combination of y i.

48 findit bcuse bcuse vote1 d su expend* ge sharea2=( expenda/( expenda+ expendb))*100 ge shareb=( expendb/( expenda+ expendb))*100 list share* su share* ge a=sharea+shareb list sharea shareb a rename a sumshare reg votea sharea shareb reg votea sharea2 shareb su votea reg votea shareb clear bcuse wage2 su hours reg wage educ su wage reg wage educ IQ reg wage educ exper corr educ exper corr educ IQ corr educ exper wage corr IQ wage

Multiple Regression: Inference

Multiple Regression: Inference The t-test: is ˆ j big and precise enough? We test the null hypothesis: H 0 : β j =0; i.e. test that x j has no effect on y once the other explanatory variables are controlled