MULTIPLE REGRESSION. part 1. Christopher Adolph. and. Department of Political Science. Center for Statistics and the Social Sciences

CSSS/SOC/STAT 321 Case-Based Statistics I MULTIPLE REGRESSION part 1 Christopher Adolph Department of Political Science and Center for Statistics and the Social Sciences University of Washington, Seattle Chris Adolph (University of Washington) Multiple Regression, part 1 1 / 33

Motivating Example: Cross-national determinants of fertility We have cross-national data from several sources: Fertility The average number of children born per adult female, in 2000 (United Nations) Education Ratio The ratio of girls to boys in primary and secondary education, in 2000 (Word Bank Development Indicators) GDP per capita Economic activity in thousands of dollars, purchasing power parity in 2000 (Penn World Tables) Agricultural Labor Percentage of the labor force working in agriculture in 2000 (International Labor Organization) Note the addition of a fourth variable Chris Adolph (University of Washington) Multiple Regression, part 1 2 / 33

Motivating Example: Cross-national determinants of fertility All three independent variables might cause the fertility rate More agricultural nations may have more children to bolster the labor force on family farms Letʼs look at the univariate summaries & bivariate regression results for this new covariate Chris Adolph (University of Washington) Multiple Regression, part 1 3 / 33

Summary of Univariate Distribution: Agricultural Labor Frequency 0 10 20 30 40 Median = 8.1% Mean = 16.0 % std dev = 17.9% 0 20 40 60 80 Agriculture workers as % of labor force Chris Adolph (University of Washington) Multiple Regression, part 1 4 / 33

Summary of Univariate Distribution: Agricultural Labor Frequency 0 10 20 30 40 Median = 8.1% Mean = 16.0 % std dev = 17.9% How would you describe this distribution? 0 20 40 60 80 Agriculture workers as % of labor force Chris Adolph (University of Washington) Multiple Regression, part 1 4 / 33

. Regression of Fertility on Agricultural Labor. Variable Estimates se t-stat p-value. Intercept 1.83 (0.15) 12.34 <0.001 Agricultural Labor 0.02 (0.01) 3.52 <0.001 N 72 R 2 0.15 RMSE 0.93 How do we read this table? Note the reduction in N: lots of cases are missing data on agricultural labor Any cases missing any covariates need to be deleted from the data before using regression (listwise deletion) Chris Adolph (University of Washington) Multiple Regression, part 1 5 / 33

0 10 20 30 40 50 2 4 6 8 Agricultural workers as % labor force Fertility Rate What looks different about this scatterplot? Chris Adolph (University of Washington) Multiple Regression, part 1 6 / 33

8 Fertility Rate 6 4 2 0 10 20 30 40 50 Agricultural workers as % labor force What looks different about this scatterplot? The high fertility cases seem to be missing (deleted due to missing data) Chris Adolph (University of Washington) Multiple Regression, part 1 6 / 33

8 Fertility Rate 6 4 2 Guatemala Oman Bolivia Jordan Namibia Paraguay Botswana Israel Ecuador Peru Malaysia El Salvador entina ombia ited Arab Maldives South Emirates Panama Africa Mexico Costa Jamaica BrazilRica Iceland Azerbaijan Denmark nds Uruguay ed New States Australia Barbados elgium Cyprus Antilles Ireland Zealand Mongo etherlands embourg Malta Finland Canada nidad Norway apore weden Kingdom Tobago Austria Croatia Cuba Mol Ge Slovak Switzerland ermany Japan Hungary Estonia Korea, Republic Portugal Spain Slovenia Rep. Latvia Greece Lithuania Poland Ukraine Bulgaria Romania o, China 0 10 20 30 40 50 Agricultural workers as % labor force Chris Adolph (University of Washington) Multiple Regression, part 1 7 / 33

0 10 20 30 40 50 2 4 6 8 Agricultural workers as % labor force Fertility Rate Is this a strong relationship? Chris Adolph (University of Washington) Multiple Regression, part 1 8 / 33

0 10 20 30 40 50 2 4 6 8 Agricultural workers as % labor force Fertility Rate Is this a strong relationship? How many datapoints would have to move to reduce the slope to 0? Chris Adolph (University of Washington) Multiple Regression, part 1 8 / 33

0 10 20 30 40 50 2 4 6 8 Agricultural workers as % labor force Fertility Rate Which are larger, the residuals or the explained variance? Chris Adolph (University of Washington) Multiple Regression, part 1 9 / 33

Density 0.4 0.3 0.2 What is the standard deviation of this distribution called? 0.1 2 1 0 1 2 3 Residuals from Fertility vs Agriculture Chris Adolph (University of Washington) Multiple Regression, part 1 10 / 33

Density 0.4 0.3 0.2 0.1 2 1 0 1 2 3 Residuals from Fertility vs Agriculture What is the standard deviation of this distribution called? The RMSE, or standard error of the regression: how much predictions from this model tend to miss by Chris Adolph (University of Washington) Multiple Regression, part 1 10 / 33

0 10 20 30 40 50 2 4 6 8 Residuals from Fertility vs Agriculture Fertility Rate How confident are we that this line has a positive slope? Chris Adolph (University of Washington) Multiple Regression, part 1 11 / 33

8 How Fertility Rate 6 4 confident are we that this line has a positive slope? 2 0 10 20 30 40 50 Residuals from Fertility vs Agriculture Are we as confident as we were for the other models? Chris Adolph (University of Washington) Multiple Regression, part 1 11 / 33

Confounders and Omitted Variable Bias Which (if any) of the three models weʼve looked at are right? Do Education, GDP, and Ag Labor all affect Fertility? Chris Adolph (University of Washington) Multiple Regression, part 1 12 / 33

Confounders and Omitted Variable Bias Which (if any) of the three models weʼve looked at are right? Do Education, GDP, and Ag Labor all affect Fertility? What if Education, GDP, and Ag Labor are correlated? If we regress Fertility on Education, and Education is correlated with GDP and Ag, might it proxy all three variables? Chris Adolph (University of Washington) Multiple Regression, part 1 12 / 33

Confounders and Omitted Variable Bias Which (if any) of the three models weʼve looked at are right? Do Education, GDP, and Ag Labor all affect Fertility? What if Education, GDP, and Ag Labor are correlated? If we regress Fertility on Education, and Education is correlated with GDP and Ag, might it proxy all three variables? Yes: if countries which educate women also tend to be rich and have few ag workers, then the bivariate results will blur all three relationships Chris Adolph (University of Washington) Multiple Regression, part 1 12 / 33

Confounders and Omitted Variable Bias Should we be worried? Correlation between: Education & GDP is 0.46 Chris Adolph (University of Washington) Multiple Regression, part 1 13 / 33

Confounders and Omitted Variable Bias Should we be worried? Correlation between: Education & GDP is 0.46 Correlation between GDP & Ag is -0.64 Chris Adolph (University of Washington) Multiple Regression, part 1 13 / 33

Confounders and Omitted Variable Bias Should we be worried? Correlation between: Education & GDP is 0.46 Correlation between GDP & Ag is -0.64 Correlation between Education & Ag is -0.41 (What do these numbers mean?) Omitted variable bias: Leaving any of these variables out of our model could lead to misleading estimates of the effects of any variables we do include Chris Adolph (University of Washington) Multiple Regression, part 1 13 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i Our dependent variable likely depends on many covariates For example, x 1 might be the education ratio, x 2 might be GDP per capita, and so on for as many covariates as we have, up to our kth covariate This leads to the above model, with multiple partial slopes β 1, β 2, β 3 Chris Adolph (University of Washington) Multiple Regression, part 1 14 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i This model is still a linear regression model. Sometimes called this is called a multiple regression model to distinguish it from a bivariate regression, but mathematically, they are equivalent Henceforth, we will assume a linear regression can have many covariates How many covariates are allowed? Up to N 1, where N is the number of observations Each covariate added uses up a degree of freedom; once they are gone, there is nothing left for an additional covariate to explain Chris Adolph (University of Washington) Multiple Regression, part 1 15 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i How do we interpret the βʼs? Just as before. Chris Adolph (University of Washington) Multiple Regression, part 1 16 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i How do we interpret the βʼs? Just as before. The βʼs are still slopes, or the amount y i changes on average for a 1 unit increase in x, all else held equal If we increase x 1 by 1 unit, and hold x 2 fixed at its present level, then y goes up by β 1 Chris Adolph (University of Washington) Multiple Regression, part 1 16 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i How do we interpret the βʼs? Just as before. The βʼs are still slopes, or the amount y i changes on average for a 1 unit increase in x, all else held equal If we increase x 1 by 1 unit, and hold x 2 fixed at its present level, then y goes up by β 1 Weʼve finally found a way to control for confounders using observational data! Chris Adolph (University of Washington) Multiple Regression, part 1 16 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i Aside for calculus-users: The βʼs are partial derivatives with respect to the x they multiply Chris Adolph (University of Washington) Multiple Regression, part 1 17 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i Aside for calculus-users: The βʼs are partial derivatives with respect to the x they multiply To see this, imagine a model with three covariates: x 1, x 2, and x 3. What is the effect of a tiny change in x 2 on y, holding other xʼs constant? Chris Adolph (University of Washington) Multiple Regression, part 1 17 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i Aside for calculus-users: The βʼs are partial derivatives with respect to the x they multiply To see this, imagine a model with three covariates: x 1, x 2, and x 3. What is the effect of a tiny change in x 2 on y, holding other xʼs constant? y x 2 Chris Adolph (University of Washington) Multiple Regression, part 1 17 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i Aside for calculus-users: The βʼs are partial derivatives with respect to the x they multiply To see this, imagine a model with three covariates: x 1, x 2, and x 3. What is the effect of a tiny change in x 2 on y, holding other xʼs constant? y = (β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 ) x 2 x 2 Chris Adolph (University of Washington) Multiple Regression, part 1 17 / 33

The linear regression model, redux y i = β 0 + β 1 x 1i + β 2 x 2i +... + β k x ki + ε i Aside for calculus-users: The βʼs are partial derivatives with respect to the x they multiply To see this, imagine a model with three covariates: x 1, x 2, and x 3. What is the effect of a tiny change in x 2 on y, holding other xʼs constant? y = (β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 ) x 2 x 2 = β 2 This makes β k a very useful summary of the effect of x k on y Chris Adolph (University of Washington) Multiple Regression, part 1 17 / 33

Multiple regression: just like bivariate. 1 Our estimates, ˆβ k, are the β kʼs that minimize the sum of the squared residuals (least squares) Chris Adolph (University of Washington) Multiple Regression, part 1 18 / 33

Multiple regression: just like bivariate. 1 Our estimates, ˆβ k, are the β kʼs that minimize the sum of the squared residuals (least squares). 2 The uncertainty of each ˆβ k is given by its standard error. 3 We can still perform t-tests and calculate confidence intervals for each ˆβ k. 4 We can still calculate the fitted value ŷ i of any observation i: this is the model prediction for that case Chris Adolph (University of Washington) Multiple Regression, part 1 18 / 33

Multiple regression: just like bivariate. 1 Our estimates, ˆβ k, are the β kʼs that minimize the sum of the squared residuals (least squares). 2 The uncertainty of each ˆβ k is given by its standard error. 3 We can still perform t-tests and calculate confidence intervals for each ˆβ k. 4 We can still calculate the fitted value ŷ i of any observation i: this is the model prediction for that case 5. We can still summarize goodness of fit using such measures as RMSE and R 2 Chris Adolph (University of Washington) Multiple Regression, part 1 18 / 33

Fertility as function of Education and GDP per capita Letʼs start small: a model with two covariates: Fertility i = ˆβ 0 + ˆβ 1 Edu atio i + ˆβ 2 GDPpc i Fertility i = 11.24 0.08 Edu atio 1 0.05 GDPpc i We can present this result in several ways:. 1 In a table by itself. 2 In a table compared to other models. 3 Through graphics Chris Adolph (University of Washington) Multiple Regression, part 1 19 / 33

. Regression of Fertility on Education Ratio & GDP. Variable Estimates se t-stat p-value. Intercept 11.25 (0.73) 15.46 <0.001 Education Ratio -0.08 (0.01) -9.93 <0.001 GDP per capita ($k) -0.05 (0.01) -5.32 <0.001 N 130 R 2 0.64 RMSE 1.01 How do we interpret the above? Chris Adolph (University of Washington) Multiple Regression, part 1 20 / 33

. Three regression models of fertility. Model Variable 1 2 3 Intercept 12.59 4.13 11.25 (0.75) (0.17) (0.73) Education Ratio -0.10-0.08 (0.01) (0.01) GDP per capita -0.10-0.05 (0.01) (0.01) N 130 130 130 R 2 0.55 0.35 0.64 RMSE 1.12 1.35 1.01. Standard errors in parentheses How do we interpret the above table? Chris Adolph (University of Washington) Multiple Regression, part 1 21 / 33

. Three regression models of fertility. Model Variable 1 2 3 Intercept 12.59 4.13 11.25 [11.11, 14.08] [3.80, 4.46] [9.81, 12.69] Education Ratio -0.10-0.08 [-0.12, -0.08] [-0.10, -0.06] GDP per capita -0.10-0.05 [-0.12, -0.08] [-0.07, -0.03] N 130 130 130 R 2 0.55 0.35 0.64 RMSE 1.12 1.35 1.01 95%. confidence intervals in brackets This table presents the same information, but is easier to digest Chris Adolph (University of Washington) Multiple Regression, part 1 22 / 33

2 4 6 8 2 4 6 8 Model fitted values, Fertility hat Actual data, Fertility To see the residuals, compare the model fit with reality Chris Adolph (University of Washington) Multiple Regression, part 1 23 / 33

2 4 6 8 2 4 6 8 Model fitted values, Fertility hat Actual data, Fertility Note that in the multivariate case, we need to plot against ŷ i, not x i, because there is more than one x i Chris Adolph (University of Washington) Multiple Regression, part 1 24 / 33

Actual data, Fertility 8 6 4 2 Niger Uganda Chad Somalia Malawi Burkina Faso Yemen, Rep. Zambia Ethiopia Benin Guinea Rwanda Liberia Guinea Bissau Equatorial Guinea Mali Mozambique Senegal Eritrea Mauritania Cote Togo d'ivoire Kenya Iraq Guatemala Congo, Rep. Djibouti Solomon Ghana Islands Samoa OmanVanuatu Comoros Tonga Swaziland Lesotho Bolivia Namibia Gabon Tajikistan Nepal JordanZimbabwe Cambodia Bhutan Paraguay Belize Botswana Nicaragua India Qatar Bangladesh Fiji Israel Malaysia Ecuador South El Salvador Maldives AfricaPeru United Arab Emirates Panama Bahrain Colombia Morocco Jamaica Kuwait Brunei Argentina Mexico Costa Brazil Rica Guyana LebanonIndonesia Vietnam Uruguay Albania Netherlands Mongolia Iceland Chile Antilles Tunisia United Azerbaijan Ireland New States ZealandMauritius Kazakhstan Norway Denmark Finland Australia France ourg Netherlands Cyprus Belgium United Singapore Malta Macedonia, Cuba Georgia FYR weden Trinidad Kingdom and Tobago Canada Moldova Switzerland Portugal Barbados Croatia Germany Japan Austria Greece Korea, Estonia Belarus Hungary Lithuania Romania Rep. Slovak Republic Poland Slovenia Spain Latvia Ukraine Bulgaria Macao, China Examining which cases are big outliers may suggest additional variables to include as covariates 2 4 6 8 Model fitted values, Fertility hat Chris Adolph (University of Washington) Multiple Regression, part 1 25 / 33

Actual data, Fertility 8 6 4 2 Niger Uganda Chad Somalia Malawi Burkina Faso Yemen, Rep. Zambia Ethiopia Benin Guinea Rwanda Liberia Guinea Bissau Equatorial Guinea Mali Mozambique Senegal Eritrea Mauritania Cote Togo d'ivoire Kenya Iraq Guatemala Congo, Rep. Djibouti Solomon Ghana Islands Samoa OmanVanuatu Comoros Tonga Swaziland Lesotho Bolivia Namibia Gabon Tajikistan Nepal JordanZimbabwe Cambodia Bhutan Paraguay Belize Botswana Nicaragua India Qatar Bangladesh Fiji Israel Malaysia Ecuador South El Salvador Maldives AfricaPeru United Arab Emirates Panama Bahrain Colombia Morocco Jamaica Kuwait Brunei Argentina Mexico Costa Brazil Rica Guyana LebanonIndonesia Vietnam Uruguay Albania Netherlands Mongolia Iceland Chile Antilles Tunisia United Azerbaijan Ireland New States ZealandMauritius Kazakhstan Norway Denmark Finland Australia France ourg Netherlands Cyprus Belgium United Singapore Malta Macedonia, Cuba Georgia FYR weden Trinidad Kingdom and Tobago Canada Moldova Switzerland Portugal Barbados Croatia Germany Japan Austria Greece Korea, Estonia Belarus Hungary Lithuania Romania Rep. Slovak Republic Poland Slovenia Spain Latvia Ukraine Bulgaria Macao, China 2 4 6 8 Model fitted values, Fertility hat Examining which cases are big outliers may suggest additional variables to include as covariates Think of what the missing cases have in common Chris Adolph (University of Washington) Multiple Regression, part 1 25 / 33

linear predictor Visualizing the modelled relationship between many variables is tricky Education Ratio GDP per capita Chris Adolph (University of Washington) Multiple Regression, part 1 26 / 33

linear predictor Visualizing the modelled relationship between many variables is tricky Education Ratio GDP per capita We can do it with a 3D plot for 2 covariates, but not for 3 or more Chris Adolph (University of Washington) Multiple Regression, part 1 26 / 33

8 Vary Education; GDP at mean 8 Vary GDP; Education at mean Fertility Rate 6 4 6 4 2 2 60 70 80 90 100 110 Education Ratio 10 20 30 40 50 GDP pc $k An alternative that works for any number of covariates: Plot out the model predictions as a function of each covariate, while holding the other covariates fixed, e.g., at their means Then predict what Fertility rate should happen on average if the country had average GDP but variable Education (or vice versa) Chris Adolph (University of Washington) Multiple Regression, part 1 27 / 33

8 Vary Education; GDP at mean 8 Vary GDP; Education at mean Fertility Rate 6 4 6 4 2 2 60 70 80 90 100 110 Education Ratio 10 20 30 40 50 GDP pc $k Letʼs compare the multiple regression estimates (in color) with the bivariate regression results (in black) How are they different? Are the bivariate results affected by omitted variable bias? Chris Adolph (University of Washington) Multiple Regression, part 1 28 / 33

. Regression models including Agricultural Labor. Model Variable 1 2 3 4 Intercept 11.15 2.76 1.83 8.95 (2.64) (0.18) (0.15) (2.79) Education Ratio -0.09-0.06 (0.03) (0.03) GDP per capita ($k) -0.04-0.03 (0.01) (0.01) Agriculture Labor 0.02 0.004 (0.01) (0.008) N 72 72 72 72 R 2 0.13 0.17 0.14 0.26 RMSE 0.94 0.92 0.93 0.88. Standard errors in parentheses Chris Adolph (University of Washington) Multiple Regression, part 1 29 / 33

. Regression models including Agricultural Labor. Model Variable 1 2 3 4 Intercept 11.15 2.76 1.83 8.95 [5.90, 16.41] [2.39, 3.13] [1.53, 2.12] [3.38, 14.52] Edu Ratio -0.09-0.06 [-0.14, -0.04] [-0.12, -0.01] GDP pc -0.04-0.03 [-0.06, -0.02] [-0.06, -0.003] Ag Labor 0.02 0 0.004 [0.01, 0.03] [-0.01, 0.02] N 72 72 72 72 R 2 0.13 0.17 0.14 0.26 RMSE 0.94 0.92 0.93 0.88. 95% confidence intervals in brackets Chris Adolph (University of Washington) Multiple Regression, part 1 30 / 33

Vary Edu; GDP & Ag at mean 8 Vary GDP; Edu & Ag at mean 8 Vary Ag; Edu & GDP at mean 8 Fertility Rate 6 4 6 4 6 4 2 2 2 60 70 80 90 100 110 Education Ratio 10 20 30 40 50 GDP pc $k 0 10 20 30 40 50 60 70 Ag Labor % How do we interpret these plots? The dashed lines indicate extrapolation: no observed data have these values for the covariates Chris Adolph (University of Washington) Multiple Regression, part 1 31 / 33

Vary Edu; GDP & Ag at mean 8 Vary GDP; Edu & Ag at mean 8 Vary Ag; Edu & GDP at mean 8 Fertility Rate 6 4 6 4 6 4 2 2 2 60 70 80 90 100 110 Education Ratio 10 20 30 40 50 GDP pc $k 0 10 20 30 40 50 60 70 Ag Labor % The black lines show the bivariate results. Was there omitted variable bias? Chris Adolph (University of Washington) Multiple Regression, part 1 32 / 33

Vary Edu; GDP & Ag at mean 8 Vary GDP; Edu & Ag at mean 8 Vary Ag; Edu & GDP at mean 8 Fertility Rate 6 4 6 4 6 4 2 2 2 60 70 80 90 100 110 Education Ratio 10 20 30 40 50 GDP pc $k 0 10 20 30 40 50 60 70 Ag Labor % The black lines show the bivariate results. Was there omitted variable bias? YES. The apparent effect of Ag Labor was a mirage: just the omitted effect of GDP per capita. If we control for GDP, we see Ag Labor has no effect. Chris Adolph (University of Washington) Multiple Regression, part 1 32 / 33

Warning! Linear regression is powerful, but easy to misuse We mentioned one assumption last time: That the error term is Normally distributed To this we now add two additonal assumptions Correct specification The model contains all the covariates that produce Y. If any omitted cause of Y is correlated with the included Xʼs, then ˆβ can no longer be trusted. No endogeneity of Y None of the included Xʼs are caused by Y More on these assumptions next time Chris Adolph (University of Washington) Multiple Regression, part 1 33 / 33