Ordinary Least Squares (OLS): Multiple Linear Regression (MLR) Analytics What s New? Not Much!

Size: px

Start display at page:

Download "Ordinary Least Squares (OLS): Multiple Linear Regression (MLR) Analytics What s New? Not Much!"

Amelia Rogers
5 years ago
Views:

1 Ordinary Least Squares (OLS): Multiple Linear Regression (MLR) Analytics What s New? Not Much! OLS: Comparison of SLR and MLR Analysis Interpreting Coefficients I (SRF): Marginal effects ceteris paribus The Collinearity Regression Multicollinearity and R and VIF Omitted Variable Bias/Impact (Endogeneity) Interpreting Coefficients II (SLR): Regressing y on What s New about x OLS: Comparison of SLR and MLR Analytics Analysis SLR MLR Data Generation Model SLR.1: Linear Model y = β + β x + u i 1 i i MLR.1: Linear Model y = β + β x + β z + u i x i z i i y = β + β x + β z + β w + u i x i z i w i i y = β + β x + u i i i Residuals/ Unexplained = ( β + β ) u = y ( β + β x + β z ) u y x i i 1 i i i x i z i etc OLS Min SSRs Min SSRs Estimates: Intercept ˆ β ˆ = y β1x β = y β x ˆ ˆ Slopes ˆ β ( x x)( y y) i i 1 = ( xi x) Complicated SRF (Sample Regression Function) ŷ = ˆ β ˆ + β1x yˆ ˆ ˆ ˆ i = β + βxxi + βzzi, etc... yˆ ˆ β ˆ β x = + Controlling for impact of? No other RHS variables All other RHS variables

2 Analysis SLR MLR Estimated Impact from changing one RHS var from changing several RHS vars dyˆ = ˆ β1 dx ˆ yˆ yˆ = β ˆ 1 x = β1 x yˆ = ˆ β x (ceteris paribus) ˆ yˆ yˆ = β ˆ x = β x ˆ β x yˆ = the means y = ˆ β ˆ + β1x y = β + β x ˆ ˆ Elasticities (at the x d means) ˆ ˆ x y = β1 yˆ dx y x= x x ˆ ˆ x y = β yˆ x means So What s New?... not much, really! 1. Interpreting Coefficients I (SRF): Marginal effects ceteris paribus Consider the following MLR model with three RHS variables: DGM 1: yi = β + βxxi + βzzi + βwwi + vi You estimate the unknown parameter values using ordinary least squares, which gives you the following Sample Regression Function: SRF 1: yˆ = ˆ β ˆ ˆ ˆ + βxx+ βzz+ βww We use SRFs to predict y values as a function of the values of the RHS variables. Since yˆ = ˆ β x, the estimated coefficients in the SRF tell us the relationship between changes in RHS x variables (holding everything else fixed) and changes in the predicted values, the ys. ˆ ' The coefficients also tell us something about the impacts of discrete changes in the RHS variables: Consider a discrete change in x, from, say, x 1 to x. If the values of the other RHS variables are held fixed, we have two predicted values: yˆ ˆ ˆ ˆ ˆ 1 = β + βxx1+ βzz+ βww and yˆ ˆ ˆ ˆ ˆ = β + βxx + βzz+ βww. And the change in the predicted values will be: ( ) y = y y = ˆ x ˆ x = ˆ x x = ˆ x ˆ ˆ ˆ 1 βx βx 1 βx 1 βx

3 And if we have x changing by x, z changing by z, and w fixed, the change in the y = ˆ β x x + ˆ β z z = ˆ β x+ ˆ β z. ˆ x 1 z 1 x z predicted value of y will be ( ) ( ) So one interpretation of the MLR coefficients: they capture average marginal impacts on the predicted y values generated by the SRF. Example: See the handout on European football and corner kicks.. The Collinearity Regression The Collinearity regression features prominently in 1) the R measure of multi-collinearity, ) Omitted Variable Bias/Impact, and 3) the second interpretation of coefficients in MLR models (see below). Consider the MLR model above, with three RHS variables, x, z, and w, and dependent variable y: DGM 1: yi = β + βxxi + βzzi + βwwi + vi The collinearity regression looks solely at the RHS (explanatory) variables in the regression. In the collinearity regression, one of the RHS variables is regressed on the remaining RHS variables. So focusing on the RHS variable w, we have: DGMw: wi = α + αxxi + αzzi + ui, and SRFw: wˆ = ˆ α ˆ ˆ + αxx+ αzz (residual: uˆ ˆ i = wi wi) By construction, the SRF s predicted values, wˆ = ˆ α ˆ ˆ + αxx+ αzz, are the part of the w s explained by the other two explanatory variables (the x s and z s). because the predicted w s, the ws, ˆ ' are a linear function of the xs ' and zs. ' The residuals, the uˆ i ' s, are What s New about the w s the part of the w s not explained by the other explanatory variables in the model. If the residuals are all zero, then the x s and z s perfectly predict the w s, and so the w variable provides no new/additional explanatory power. In this case we say that the w s are perfectly collinear with the x s and z s. And if the residuals are sizable, then the w s are not so collinear with the x s and z s, and accordingly may provide some new and useful explanatory power in predicting the y values. Here are some collinearity regressions from a model that evaluates the relationship between the Brozek measure of body fat and hgt, wgt and abd (waist size):. bcuse bodyfat. reg Brozek hgt wgt abd. eststo. reg hgt wgt abd. eststo 3

4 . reg wgt hgt abd. eststo. reg abd hgt wgt. eststo. esttab, r scalar(rmse) (1) () (3) (4) Brozek hgt wgt abd hgt *** -.65*** (-1.43) (9.18) (-7.41) wgt -.1***.136***.349*** (-5.41) (9.18) (34.31) abd.88*** -.99***.365*** (15.19) (-7.41) (34.31) _cons -3.66*** 73.51*** -17.6*** 7.53*** (-5.1) (39.7) (-11.3) (13.3) N R-sq rmse Model (1) is the original regression; Models () - (4) are the collinearity regressions. 3. Multicollinearity - R-squared ( R ) Sxy Recall that sample correlation, ˆ ρ xy =, captures the extent to which there is a linear SS x y relationship between two variables, x and y. By definition, the correlation concept can be applied only to pairs of explanatory variables but what happens when you want to evaluate the extent to which larger groups of variables are moving together (in a linear fashion)? ˆ xy R In the SLR analysis, we found that R is also correlation squared: ρ =. And so in that analysis, we could ust as easily have used R to measure the extent to which the two variables moved together in a linear fashion. 1 While the concept of correlation does not extend to sets of more than two variables, R does. And so we will use the R of the collinearity regression to measure multicollinearity, the extent to which sets of variables are moving together in a linear fashion. In the example above, wgt is the most collinear explanatory variable since the R in Model (3) is.84 (which tells us that 84.% of the variation in the wgt variable can be explained with a linear 1 Note that 1 ˆ ρ 1 and R 1. Note that the squared correlation of the predicted y s, the yˆ ' s, with the y s does extend to MLR models: ρ = R so that s another reason to use R as a measure of collinearity in MLR models. ˆ yy ˆ 4

5 function of the other two explanatory variables, hgt and abd). abd in model (4) is almost as collinear, with R =.87. And hgt is the least collinear, with R =.59. The R-squared in the collinearity regression is often called to be associated with some explanatory variable x. R, or R-squared since it is said If it s near 1, then most of the variation in the w s can, say, be explained by the other explanatory variables (the x s and z s). And so in that sense the w s don t bring much new to the model (or offer much new independent explanatory power). In this case we say that w is highly collinear with the other RHS variables. But if R is small, then the w s are not so collinear with the other explanatory variables, and in that sense they (the w s) bring a lot of new explanatory power to the RHS of the model. When we get to inference and precision of estimation, we will see that the presence of multicollinearity will lead to higher standard errors, whatever those are. But the real problem with it is that it can lead to wacky estimated coefficients because of the ceteris paribus condition (in some cases it doesn t make sense to hold everything else fixed while changing ust one explanatory variable). (See the European football handout.) So if you have strange estimated coefficients, see if multicollinearity is driving those estimates. And as for fixes? You can always get more data. But you might also ust try re-estimating the model individually dropping highly collinear RHS variables to see what happens. We ll work through some examples in class. Another way to generate the R ' s: VIFs (Variance Inflation Factors) We will cover Variance Inflation Factors (VIFs) in detail when we get to inference, but for now, I ust note that they provide an easy way to generate the R ' s, and get a sense of the degree of multicollinearity in a MLR model. The relationship between VIFs and VIF 1 =, or R 1 R R ' s: 1 1 VIF =. So if you know one, you know the other. Note that the VIFs and direction so larger VIFs are associated with larger R ' s. R ' s move in the same If you ust run the vif command immediately after estimating your MLR model, you ll get the VIFs, and by association, the R s. Here s an example using the European Football data:. reg ptsdiff sdiff stdiff cdiff fdiff rdiff ' 5

6 Source SS df MS Number of obs = 59, F(5, 594) = Model Prob > F =. Residual , R-squared = Ad R-squared =.817 Total , Root MSE = ptsdiff Coef. Std. Err. t P> t [95% Conf. Interval] sdiff stdiff cdiff fdiff rdiff _cons vif Variable VIF 1/VIF sdiff stdiff cdiff rdiff fdiff Mean VIF di Check: Run the collinearity regression. reg sdiff stdiff cdiff fdiff rdiff Source SS df MS Number of obs = 59, F(4, 5943) = Model Prob > F =. Residual , R-squared = Ad R-squared =.665 Total , Root MSE = sdiff Coef. Std. Err. t P> t [95% Conf. Interval] stdiff cdiff fdiff rdiff _cons In this example shots and shots on target are the most collinear of the RHS variables no surprise given that their collinearity is largely driven by their high degree of correlation with one another. 3 The R s for the other RHS variables are all fairly small, and less than.5, if not.5. ' 3 Test your knowledge: Show that R for shots must be greater than or equal to the square of the correlation between shots and shots on target. 6

7 4. Omitted Variable Bias/Impact (Endogeneity) Estimated coefficients will be biased (or less peoratively, impacted) to the extent that those variables are correlated with omitted variables, which are themselves correlated with the dependent variable. This is not so much a bias as a misinterpretation. The estimated coefficients reflect the incremental average relationship between changes in the particular variable and changes in the LHS variable, controlling for all the other variables in the model. But of course, the omitted variable is not in the model. Fixes? If you can t insert the omitted variable into the model, maybe you can include a proxy variable (which might be highly correlated with the omitted variable). And if you can t do that, you might be able to at least sign the bias, and determine whether the estimated model over- or under-estimates the true parameter value(s) (relative to a model in which the omitted variable is included in the analysis). (More about fixes below.) Example: Dropping RHS variable z from the Model Consider a simple model with two explanatory variables, x and z. Using OLS to estimate the parameter values with the full model you get: SRF: yˆ = ˆ β ˆ ˆ + βxx+ βzz. Now suppose that z is dropped/omitted and the estimated model is instead: SRF: yˆ = ˆ γ ˆ + γ xx. The omitted variable bias/impact will be the change in the estimated x coefficient when z is dropped from the model: ˆ ˆx γ β x. We can derive this using the collinearity regression in which the omitted variable z is regressed on the included variable x. If the SRF from that regression is SRFz: zˆ = ˆ α ˆ + αxx, then the omitted variable bias is ust ˆ βα ˆ z x the product of: ˆz β, the estimated coefficient for the omitted variable (when it s in the full model), and ˆ α x, the respective estimated coefficients when the omitted variable is regressed on the other RHS vars in the model. Here s an example using the bodyfat dataset. (1) () (3) Brozek Brozek hgt wgt.187***.16***.384*** (14.48) (1.7) (5.1) hgt -.65*** (-6.9) _cons 31.16*** *** 63.7*** (4.51) (-4.18) (46.54) N R-sq rmse t statistics in parentheses * p<.5, ** p<.1, *** p<.1 7

8 In Model (1), Brozek has been regressed on wgt and hgt. In Model (), hgt has been dropped from the model, and the wgt coefficient drops by.5, from.187 to.16. Model (3) gives the results of the collinearity regression in which hgt, the omitted variable in Model (), is regressed on wgt, the RHS variable in model () And so the Omitted Variable Bias associated with excluding hgt from the original model is the product of the wgt coefficient in Model (3),.384, and the hgt coefficient in Model (1), -.65: (.384) (-.65) = which is exactly the amount by which the wgt coefficient dropped when hgt was dropped from the model. Qualitative Assessment Return the case of dropping z form the model so we have: Full Model SRF: y = ˆ β ˆ ˆ + β x+ β z ˆ x z Estimated Model SRF: yˆ = ˆ γ ˆ + γ xx Collinearity Regression SRFz: zˆ = ˆ α ˆ + α x, The following table summarizes the qualitative effects of omitted variable bias when you omit, say, z, from the model and ust regress y on x. Note that since ˆ αx Sz = ˆ ρxz Sx, the sign of ˆ α x will be the same as the correlation between x and z, ˆ ρ xz. Omitted Variable Bias: ˆ βα ˆ z x x z coeff. in full model correlation between x and z ˆ β z > ˆ β z < ˆ α x > Positive Negative ˆ α x < Negative Positive This table is useful because often, and especially with favorite coefficient models, you want to be able to sign the bias and if you re lucky, you can say something like: I estimated a positive effect and I know that I have an issue with omitted variable bias but since I m confident that that bias is negative, I know that the true effect is even more positive than I ve estimated and so I m confident that there really is a positive relationship. But of course, if the omitted variable bias is positive, then you know that you ve overestimated the effect, and now maybe you aren t so sure that the actual effect is positive it could ust be the omitted variable bias driving the result. 8

9 What to do if you fear omitted variable bias: 1. Don t be lazy. Get the data and include it in your model.. But maybe you can t get the data. Then maybe use an available proxy variable which is highly correlated with the omitted variable. Or try several proxy variables and see if it matters. (Example: If you don t have data on disposable personal income by MSA, use median per capita income as a proxy, or maybe median housing sales prices, or median monthly rent data.) 3. And if you are really lazy and don t want to find proxies, try the oh so sophisticated Instrumental Variables approach which we ll discuss later in the semester. But only if you are really really lazy! (Yes, you see my bias!) So far we ve looked at going from two to one explanatory variables. How do things change if we have more RHS variables in the full model? Not much! Suppose that the full model includes a third explanatory variable w, so that the full model SRF is: SRF 1: yˆ = ˆ β ˆ ˆ ˆ + βxx+ βzz+ βww. But you drop w for the analysis and ust regress y on x and z, with the resulting SRF: yˆ = b ˆ ˆ ˆ + bx x + bz z. In this case both of the estimated slope coefficients for x and z can be impacted by the omission of w from the model. To determine the bias, run the collinearity regression, regressing the omitted variable w on x and z, and generating SRFw: wˆ = ˆ α ˆ ˆ + αxx+ αzz. Then the omitted variable biases/impacts from excluding the w s from the model are: bˆ ˆ β = ˆ αβˆ (the product of the SRFw x coeff and the SRF 1 w coeff) x x x w bˆ ˆ β = ˆ αβˆ (the product of the SRFw z coeff and the SRF 1 w coeff) z z z w Here s an example, returning to the bodyfat dataset. In Model (1), Brozek has been regressed on wgt, hgt and abd. In Model (), abd has been dropped from the model. Model (3) gives the results of the collinearity regression in which abd, the omitted variable in Model () is regressed on hgt and wgt, the RHS variables in Model () 9

10 (1) () (3) Brozek Brozek abd wgt -.1***.187***.349*** (-5.41) (14.48) (34.31) hgt *** -.65*** (-1.43) (-6.9) (-7.41) abd.88*** (15.19) _cons -3.66*** 31.16*** 7.53*** (-5.1) (4.51) (13.3) N R-sq rmse t statistics in parentheses * p<.5, ** p<.1, *** p<.1 Notice that when abd was dropped from Model (1): the estimated wgt coefficient increased by.37 = (.187) - (-.1)), and the estimated hgt coefficient dropped by.53, going from to Applying the formulas above, we estimate the bias using the product of the abd coefficient in Model (1) and the respective RHS variable coefficients in the collinearity regression (Model (3)): wgt bias: (.88) * (.349) =.37, as advertised hgt bias: (.88) * (-.65) = -.53, also as advertised 5. Interpreting Coefficients II (SLR): Regressing y on what s new about x Earlier, we saw that the MLR coefficients told you something about the incremental impacts on predicted values of the dependent variable, holding everything else fixed (ceteris paribus). We now turn to a second interpretation: The MLR coefficients capture the SLR relationship between the dependent variable, and in each case, What s New about the specific RHS variable. To see this, first run the collinearity regression to determine What s New about a particular explanatory variable (What s New = residuals from the first regression), and then run a SLR model regressing y on What s New. You ll see that the SLR coefficient in the second model is exactly the same as the respective coefficient in the MLR model. So let s consider the full model, with SRF 1: yˆ = ˆ β ˆ ˆ ˆ + βxx+ βzz+ βww. We are interested in better understanding the x coefficient, ˆx β and proceed in two steps: 1

11 Step 1: What s New about x Run the collinearity regression, regressing x on the other RHS variables: SRF : xˆ = ˆ α + ˆ α z + ˆ α w (residuals: uˆ = x xˆ ) x z W i i i The residuals in the collinearity regression, û, are the part of the variable x not explained by the other RHS variables in the model. We call that WhatsNewx the part of the x s not explained by the z s and w s. So: WhatsNewx = û. Step : Regress y on WhatsNewx If you run the SLR model regressing y on WhatsNewx, you ll discover that the estimated coefficient for WhatsNewx will be exactly the same as the x coefficient in the full MLR model. So ˆx β captures the relationship between the dependent variable and What s New about the x s, the residuals in the collinearity regression (and the part of the x s not explained by the other RHS variables in the model). Here s an example using the bodyfat dataset and focusing on the abd variable:. reg Brozek wgt hgt abd Source SS df MS Number of obs = F( 3, 48) = Model Prob > F =. Residual R-squared = Ad R-squared =.7177 Total Root MSE = Brozek Coef. Std. Err. t P> t [95% Conf. Interval] wgt hgt abd _cons reg abd wgt hgt Source SS df MS Number of obs = F(, 49) = Model Prob > F =. Residual R-squared = Ad R-squared =.853 Total Root MSE = abd Coef. Std. Err. t P> t [95% Conf. Interval] wgt hgt _cons Use the predict., res command to capture the residuals: 11

12 . predict whatsnew, res And regress Brozek on What s New using a SLR model:. reg Brozek whatsnew Source SS df MS Number of obs = F(1, 5) = Model Prob > F =. Residual R-squared = Ad R-squared =.566 Total Root MSE = Brozek Coef. Std. Err. t P> t [95% Conf. Interval] whatsnew _cons So effectively, the estimated coefficients in MLR models capture the correlation between the dependent variable and What s New with each of the RHS variables. 1

σ σ MLR Models: Estimation and Inference v.3 SLR.1: Linear Model MLR.1: Linear Model Those (S/M)LR Assumptions MLR3: No perfect collinearity

σ σ MLR Models: Estimation and Inference v.3 SLR.1: Linear Model MLR.1: Linear Model Those (S/M)LR Assumptions MLR3: No perfect collinearity Comparison of SLR and MLR analysis: What s New? Roadmap Multicollinearity and standard errors F Tests of linear restrictions F stats, adjusted R-squared, RMSE and t stats Playing with Bodyfat: F tests