Bivariate and Multiple Linear Regression (SECOND PART)

Size: px

Start display at page:

Download "Bivariate and Multiple Linear Regression (SECOND PART)"

Phebe Green
6 years ago
Views:

1 ACADEMIC YEAR 2013/2014 Università degli Studi di Milano GRADUATE SCHOOL IN SOCIAL AND POLITICAL SCIENCES APPLIED MULTIVARIATE ANALYSIS Luigi Curini Do not quote without author s permission Bivariate and Multiple Linear Regression (SECOND PART) 5. Multiple Linear Regressions with continuous variables There is always the possibility that alternative causes rival explanations are at work, affecting the observed relationship between X and Y. It is in this sense possible (and advisable) to include more than one explanatory variable in a regression equation and there are various reasons that we might want to do this. In fact we rarely believe that there is only one factor influencing the outcome variable of interest. This raises several questions about the association between variables. For instance, we sometimes want to know whether there is an effect of x on y, after controlling for the effect of some other variable z. After all it may be that x is correlated with z and z is correlated with y, but for any fixed level of z there is no correlation between x and y. So an increase in x may be associated with a change in y, but this change depends entirely on what happens to z. For example, it may be that democracies are less likely to go to war with each other because they are more likely to trade with each other and it is trading, rather than democracy that reduces the chances of war. So we would say there is no effect of democracy on the chances of war after controlling for trade. In this sense we could have a spurious relationship: once the researcher controls for a rival causal factor, the original relationship becomes weak, or disappears. Another possibility is that we may observe no relationship between x and y because the pattern of association between x and z and between z and y confounds the effect of x on y. For example, it may be that democracy actually decreases the chances of going to war but democracies also have higher military capacities which make them more likely to go to war, and the net result is that democracies are no more or less likely to go to war than other countries. These kinds of issues can sometimes be resolved using multiple linear regression. This is a direct extension of linear regression, except with more than one independent (or explanatory) variable. With p explanatory variables we have, y = a + b1x1 + b2x bp xp + e Just as in the case of regression with a single regressor, the factors that determine Y in addition to X1 and X2 are incorporated into the regression equation as an error term e. The error term is the 1

2 deviation of a particular observation from the average population relationship (from our estimates according to our model). The intercept and the slope coefficients are estimated once again by minimizing the sum of squared prediction mistakes, that is, by choosing the estimators b 0, b 1 and so on as to n minimize: 2 Yi b0 b1 X1 i b2 X2i... bk Xki. i 1 The estimators of the coefficients b 0, b 1, b k, that minimize the sum of squared mistakes are called the ordinary least squares (OLS) estimators of b 0, b 1, b k. The OLS regression line is the straight line constructed using the OLS estimators, that is bˆ ˆ 0, b 1, etc. Based on the OLS regression line is Yˆ bˆ bˆx.... The OLS residual for the ith observation is the difference between Y i and its the i predicted value, that is the difference between Y i and Y ˆi. In multiple linear regression the bi are partial regression coefficients. So bk is the estimated change in the dependent variable y associated with a unit increase in xk keeping all the other independent variables in the model constant. What do we mean that a particular beta coefficient in a multiple regression is the effect on Y of a unit change in X1, holding X2 constant or controlling for X2? Let s write an hypothetical regression function: Y = b0+b1x1+b2x2 and imagine changing X1 by the amount X1 while not changing X2, that is, while holding X2 constant. Because X1 has changed, Y will change by some amount, say Y. After this change the new value of Y, Y+ Y is: Y+ Y= b0+b1(x1+ X1)+b2X2 An equation for Y in terms of X1 (that explains the value of Y) is obtained by subtracting the equation Y = b0+b1x1+b2x2 from equation Y+ Y= b0+b1(x1+ X1)+b2X2, yielding Y= b1 X1, that is: b1= Y/ X1, holding X2 constant. The coefficient b1 is the effect on Y (the expected change in Y) of a unit change in X1, holding X2 fixed. Another phrase used to describe b1 is the partial effect on Y of X1, holding X2 fixed. Finally, the interpretation of the intercept in the multiple regression model, b0, is similar to the interpretation of the intercept in the single-regressor model: it is the expected value of Y when X1 and X2 are zero. Simply put, the intercept b0 determines how far up the Y axis the population regression line starts. Look at the multiple linear regression of ecogr709 on const45 and federal45 together. Let s assume that your theory states that as the level of certainty of the rules increases (const45) and as the level of intra-region economic competition decreases (federal45), then the economic growth should increase. 2

3 Let s first look at the scatterplot matrix for the variables in our regression model as a preliminary step. graph matrix ecogr709 const45 federal45, half mlabel(countryt) twoway (scatter ecogr709 const45, mlabel(countryt) mlabp(9) mlabs(vsmall)) (lfit ecogr709 const45) twoway (scatter ecogr709 federal45, mlabel(countryt) mlabp(9) mlabs(vsmall)) (lfit ecogr709 federal45) Now let s compare the following three equations. The first two are simply bivariate ones, while the third one is a multivariate OLS. reg ecogr709 const45 reg ecogr709 federal45 reg ecogr709 const45 federal45 Has anything changed? Why in the first two equations the two IV are not significant, while in the third model now both coefficients are significant? Now let s add one further variable (judrev45, that is the existence of a constitution subject to judicial review. After all, even this aspect increases the certainty of rules against political interference) reg ecogr709 const45 federal45 judrev45 So far, we have concerned ourselves with testing a single variable at a time, for example looking at the coefficient for const45 and determining if that is significant. We can also test sets of variables, using the test command, to see if the set of variables are significant. First, let's start by testing a single variable, const45, using the test command. test const45==0 ( 1) ell = 0.0 F( 1, 385) = Prob > F = If you compare this output with the output from the last regression you can see that the result of the F-test, 12.62, is the same as the square of the result of the t-test in the regression (3.55^2 = 12.62). Note that you could get the same results if you typed the following since Stata defaults to comparing the term(s) listed to 0. test const45 ( 1) ell = 0.0 F( 1, 385) = Prob > F = Perhaps a more interesting test would be to see if the contribution of Constitution is significant. Since the information regarding Constitution is contained in two variables, const45 and judrev45, we include both of these with the test command. 3

4 test const45 judrev45 ( 1) acs_k3 = 0.0 ( 2) acs_46 = 0.0 F( 2, 385) = 3.95 Prob > F = The significant F-test, 3.95, means that the collective contribution of these two variables is significant. One way to think of this, is that there is a significant difference between a model with const45 and judrev45 as compared to a model without them, i.e., there is a significant difference between the "full" model and the "reduced" models. You can also make some estimation with the corresponding c.i.: lincom _b[_cons] +_b[const45]*2+_b[federal45]*1+_b[judrev45]*3 Finally, we can also estimate the difference between the previous value and another value characterized by just a different value of the federal45 variable. lincom (_b[_cons] +_b[const45]*2+_b[federal45]*3+_b[judrev45]*3)-(_b[_cons] +_b[const45]*2+_b[federal45]*1+_b[judrev45]*3) In our case, increasing the value of federalism compared to the previous situation, would reduce my level of growth of 1.13 points. Addendum: Which variable is more important? Significance testing only tells us how confident we can be that the true effect is not zero, or how confident that the sign of the coefficient is correct. Often we want to know which of several predictor variables is the most important. This is a complex question for which there is no simple answer. We can start by observing that in the following regression: reg ecogr709 const45 federal45, the a b s o l u t e coefficient of const45 is actually bigger than that of federal45. But this does not mean that const45 is more important than deferal45, since a unit change in const45 does not mean the same as a unit change in federal45. They are not measured in the same units (tab const45 federal45). To address this problem, we can add an option to the regress command called beta, which will give us the standardized regression coefficients. The beta coefficients are used by some researchers to compare the relative strength of the various predictors within the model. Because the beta coefficients are all measured in standard deviations, instead of the units of the variables, they can be compared to one another. In other words, the beta coefficients are the coefficients that you would obtain if the outcome and predictor variables were all transformed standard scores, also called z-scores, before running the regression. The standard score is: where: x is a raw score to be standardized; μ is the mean of the population; 4

5 σ is the standard deviation of the population. The quantity z represents the distance between the raw score and the population mean in units of the standard deviation. z is negative when the raw score is below the mean, positive when above. The standard score indicates how many standard deviations an observation is above or below the mean: the standard deviation is the unit of measurement of the z-score. It allows comparison of observations from different normal distributions, which is done frequently in research. reg ecogr709 const45 federal45, b Because the coefficients in the Beta column are all in the same standardized units you can compare these coefficients to assess the relative strength of each of the predictors. In this example, const45 has the largest Beta coefficient,. Thus, a one standard deviation increase in const45 leads to a 1.0 standard deviation increase in predicted ecogr, with the other variables held constant. In interpreting this output, remember that the difference between the numbers listed in the Coef. column and the Beta column is in the units of measurement. For example, to describe the raw coefficient for const45 you would say A one-unit decrease in const45 would yield a.70-unit increase in the predicted ecogrowth. This makes a lot of sense to me! However, for the standardized coefficient (Beta) you would say, A one-standard deviation decrease in const45 would yield a 0.7 standard deviation increase in the predicted ecogrowth. Not that easy to understand, after all! Standardizing makes therefore the coefficients substantially more difficult to interpret! Moreover, several critics of standardized regression coefficients argue that this is illusory: there is no reason why a change of one SD in one predictor should be equivalent to a change of one SD in another predictor. Some variables are easy to change - the amount of time watching television, for example. Others are more difficult - weight or cholesterol level. Others are impossible - height or age. In summary, standardized coefficients are in general (1) more difficult to interpret, (2) may add seriously misleading information. The original, unstandardized coefficients are meaningful and are not subject to these problems, although they generally cannot be compared for importance. A more important and final point is that most times scholars are not interested in finding out which variable will win the race. Most often it is theoretically "good enough" to say that even after controlling for a set of variables (i.e., plausible rival hypotheses, possible confounding influences), the variable in which we are interested still seems to have an important influence on the dependent variable. This is precisely the empirical evidence for which we search to substantiate or refute our theoretical expectations. Usually, little (social or political) understanding is gained by hypothesizing a winner in a race of the variables. 6. Multiple Linear Regressions with categorical variables Suppose that we want to investigate which factors affect the shape of party systems, estimated by employing the effective number of elective parties. To answer this question we will employ the Neto and Cox (1997) dataset. 5

6 The effective number of elective parties is a continuous measure of the number of parties defined 1 as: ENEP 2 v, where v i is the share of the vote for party i. If vote shares are replaced with i shares of seats we have the effective number of parliamentary parties instead. Suppose further that our theory predicts a relationship between ENEP and the type of government system. In particular, since the president is just one person in a national competition, presidential elections can be like a single member simple plurality election with only one district, the whole country. So having a presidential system can be a powerful factor reducing the number of parties in the legislature. There is an indicator (or dummy variable) called pres which takes the value 1 if there is a president with executive or legislative powers and zero otherwise. First let s just create a dummy of presidential system against non presidential ones: codebook prestype recode prestype (0=0 "Parliamentary democracy") (1/2=1 "Presidential democracy"), gen(presidential) tab prestype presidential reg enpv presidential table country, contents (mean enpv) This test shows that Presidential systems actually have more parties on average than Parliamentary systems, but we cannot quite be confident of it at the 5% level. However, if you use a one-tailed test (i.e., you predict that the parameter will go in a particular direction), then you can divide the p value by 2 before comparing it to your preselected alpha level. In this case, the Presidential variable turns to be significant at the 5% level. Two-sided alternative hypothesis to H 0 : H : vs. H : 0 j j,0 1 j j,0 One-sided alternative hypothesis to H 0 : H : vs. H : or 0 j j,0 1 j j,0 Of course, it may be that countries with a run-off election for president are less likely to see a strong curtailing effect on the effective number of parties in the legislature. The variable prestype is a classification of different types of presidential system according to whether they have run-off elections for president or not. Consider the table: table prestype, contents(freq mean enpv) This shows that there are differences between countries with run-off and single shot presidential elections in the effective number of parties. Suppose then that you want to test for differences between presidential types we need to create dummy variables, one for each type of presidential system and introduce these into the model. tab prestype, gen(presdummy) reg enpv presdummy2 presdummy3 A quick way to do this is to use the xi command. xi: regress enpv i.prestype 6

7 The xi: command (that we have already discussed) can be placed before a regression type command to create indicator variables for any explanatory variables with the i. suffix. This is helpful, especially when there are lots of different categories. Note that Stata creates new variables _Iprestype_1 and _Iprestype_2 for each value of prestype except the first. The first category is dropped because we cannot include indicators for all the different types of system in the model at once because the model would not be identified. Instead the first category of the categorical variable is treated as the baseline category. So the coefficient of _Iprestye_2 estimates the additional number of parties associated with one-shot presidential electoral systems relative to the number in the baseline category. In this case the baseline is no presidency at all. So the _cons value is the mean for the parliamentary democracies. The Iprestype_1 is the mean of effective number of parties for presidential with single election minus the mean of the omitted group. And the coefficient Iprestype_2 is the mean of the presidential democracies with run-off minus the mean of parliamentary democracies. If we are interested in whether there is an effect of the categorical variable prestype as a whole, we need to test the hypothesis that the coefficients of both the _Iprestype_ dummies are simultaneously zero. Do this using: test _Iprestype_1 _Iprestype_2 test _Iprestype_1 _Iprestype_2, m In this case we are doing a test of joint hypothesis on two or more coefficients: Two-sided alternative hypothesis to H 0 : H : 0 and 0 vs. H : 0 and Why can t we just test the individual coefficient one at a time? This is problematic every time there is some correlation between the regressors. We should use the F-Statistic to test joint hypothesis about regression coefficients. The F-test shows that we cannot reject the hypothesis that both of the prestype dummies are zero at the 90% c.i. This is similar to the value we got from: reg enpv presidential. We can also test if the two effects are significantly different from each other. Do presidential democracies with run off system present a higher number of parties compared to presidential democracies with a single electoral system? test _Iprestype_1 = _Iprestype_2 Stata tests the null hypothesis that the two coefficients are not statistically different from each other and returns a p-value for this test. Here we can conclude that the two coefficients are not significantly different from each other. What if we wanted a different group to be the reference group? For example the run off presidential democracies? char prestype [omit] 2 xi: regress enpv i.prestype 7

8 7. Multiple Linear Regressions with continuous and categorical variables Let s go back to the dataset nes2004, and let s try to explore a bit better the determinants of the popularity of the 2004 presidential candidate Kerry. use "D:\Pdh08_09\Lezione 1\nes2004.dta", clear codebook gender xi: reg kerry_therm welfare_therm i.gender (Note that Stata automatically created a dummy variable Igender2 coded 0 for men the lowest value on gender and 1 for women. Otherwise we should create it with the tab command!) According to our model, the impact of gender on kerry_therm is the same regardless of the value of welfare therm (once the value of welfare therm if fixed at a given level). That is, the impact of gender on kerry_therm is an additive one! What do we mean by that? When do we have an additive relationship between the IV, the DV and the other control variable? Every time the strength and the tendency of the relationship between IV and DV remains similar for all the values of the control variable lincom (_b[_cons] + _b[_igender_2] + _b[welfare_therm]*40) -(_b[_cons] + _b[_igender_2]*0 + _b[welfare_therm]*40) lincom (_b[_cons] + _b[_igender_2] + _b[welfare_therm]*70) -(_b[_cons] + _b[_igender_2]*0 + _b[welfare_therm]*70) predict yhat scatter yhat welfare_therm The coefficient for welfare_therm indicates that for every unite increase in welfare_therm the kerry_themr is predicted to increase by units. This is the slope of the lines shown in the graph. The graph has two lines, one for the male and one for the female. The coefficient for gender is 3.65 indicating that being a woman, compared of being a man, is expected to increase the kerry therm score by about As you can see in the graph, the top line is about 3.65 units higher than the lower line. Moreover, the intercept is around 39 for the lower line (male: when gender is 0) and the intercept is around 44 for the upper line (female: when gender is 1). Of course, we can also use within the same OLS more than one dummy and more than a continuous variable! Let s say that we suspect that partisanship has a big effect on the Kerry Thermometer as well. codebook partyid3 We want that our intercept has a meaning so we re-centred the welfare therm variable (this does not affect the coefficient for welfare_therm, just its interpretation): mean welfare_therm gen welfaremean = (welfare_therm ) 8

9 Alternatively: or: egen welfaremean = mean(welfare_therm) list welfare_therm welfaremean in 1/10 We also want that independents (partyid=2) is the omitted category, therefore: char partyid3 [omit] 2 xi: reg kerry_therm welfare_therm i.gender i.partyid3 Now gender is significant only at the 90% once controlling for partyid (almost a spurious relationship between gender and kerry_therm: that is, from this we can infer that women are more likely than men to be Democrats) and the coefficient for welfare_therm decreases (even if still significant after controlling for partyid3). Now the constant, our point of reference, identifies the mean kerry therm score for a male, independent and with an average score for welfare therm. The coefficient on Female tells us how much to adjust the male part of the intercept, controlling for partisanship and welfare therm. Thus, compared with male independents with an average welfare therm value/attitude, a male democrats average 17 degrees higher on the Kerry thermometer. test _Ipartyid3_1 _Ipartyid3_3 test _Ipartyid3_1 = _Ipartyid3_3 SECOND ASSIGNMENT Using the dataset on Satisfaction with democracy (satisfaction_with_democracy_europe.dta) Develop two competing models to explain the difference among European countries on the level of satisfaction with democracy. Introduce first your (main) hypotheses, then present the tables of the regression coefficients and describe your results in no more than 500 words. NB: ASSIGNMENTS THAT EXCEED THE WORD LIMITS WILL NOT BE MARKED. Summing up: some GOLDEN RULES: 1. My R 2 is bigger than yours! So publish me! Sometimes R 2 is considered to be a measure of the fit between the statistical model and the true model. A high R 2 is considered to be proof that the correct model has been specified or that the theory being tested is correct. A higher R 2 in one model is also taken to mean that that model is better. All these interpretations are wrong. R 2 is a measure of the spread of points around a regression line. Full stop! There is nothing intrinsically interesting in the spread of points around a regression line. If you are interested in the precision with which you can confidently make inferences, then look at your standard errors (see below)!!! 2. You do not run an OLS to maximize the R 2! You only run an OLS to test hypotheses (hopefully derived from a theory!!!). Besides, if your R 2 close to 1, (usually) you can have problems with your estimated model (such as using as IV a different version of your DV). 9

10 3. Remember: the R 2 does NOT tell you whether: 1) an included variable is statistically significant (to ascertain this you need to perform a hypothesis testing using the t-statistics); 2) the regressors are a true cause of the movements of the DV (you can have an high R 2 but the relationship is not causal: spurious one!); 3) you have chosen the most appropriate set of regressors (this question is just related to theory and the nature of the questions being addressed. An high R 2, or a low one, does not mean that you have the most appropriate set of regressors, or an inappropriate set of regressors!); 4) there is no omitted variable bias if the R 2 is high (you have omitted variable bias in an estimator because a variable that is a determinant of DV and is correlated with a regressor (X) has been omitted from the regression: omitted variable bias can occur in regression with a low or a high R 2 : even in this case, the theory is crucial!) 4. Never take out from your model IV if they are not significant! You added them according to the literature and to the theory (otherwise why have you added them in the first place?). Therefore, by dropping them, you incur in an (at least theoretical) omission-bias problem! Besides statistical problems: your model estimated when you have included all your (theoretically) relevant variables is going to be a different thing compared to when you drop them!!! 5. Never dropping from your dataset influential observations! They are influential for a given reason!!! Try to understand it to improve your model. And if you cannot, add a dummy for them or use robust s.e. (more on this later )! 6. Never select your data according to the value they display for your DV: selection bias problems! Example of extreme-right parties in Europe analysis 7. Finally, no data mining! Or if you do it, at least DO NOT SAY IT! Sources: King, Gary. How Not to Lie With Statistics: Avoiding Common Mistakes in Quantitative Political Science, American Journal of Political Science, Vol. 30, No. 3 (August, 1986): Pp King, Gary. Truth is Stranger than Prediction, More Questionable Than Causal Inference, American Journal of Political Science, Vol. 35, No. 4 (November, 1991): Pp King, Gary; Michael Tomz; and Jason Wittenberg. Making the Most of Statistical Analyses: Improving Interpretation and Presentation, American Journal of Political Science, Vol. 44, No. 2 (April, 2000):

Review of Multiple Regression

Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate